Concurrency Issues on Data Downloading [BUG]

SamuelGong commented 3 years ago

Describe the bug For those datasets that are not shipped by torch and thus have to be manually downloaded (e.g., cinic10, multimodal_base, pascal_voc, and tiny_imagenet), they are currently downloaded as a whole (i.e., the whole training and testing datasets) in the constructors of the respective DataSource instances.

While this design may function well in the testing environment where servers and all the clients colocate in one machine, it may come across with severe concurrency issues in some situations such as that in Deploying a Plato Federated Learning Server in a Production Environment, which Plato also aims to support.

To see that, consider the two cases separately:

For the former case, it is always the server who starts to call its configure() method, and only when the call returns does the server spawns clients in the same machine. In this way, when clients call their configure() independently, none of them needs to download the dataset, again, as it is well prepared as a whole during the initialization of the server.
For the latter case, however, the server may not colocate with clients. If a remote machine (where there is no server) hosts multiple clients and these clients are concurrently initialized, then the current design implies the possibility that these clients all (1) think that the desired data is not ready at the local storage, and thus (2) download and preprocess (at least "unzip") the data concurrently. If this is the case,
1. network bandwidth/CPU cycles/memory will be wasted due to redundant work,
2. program runtime will be elongated out of the same reason, and more importantly,
3. unexpected stalls or faults may be caused for concurrent creation of the dataset at the file system.

To Reproduce This bug should conceptually make sense. We may provide the steps for reproducing it later, if necessary.

Additional context We spotted this bug during the development of a new feature FEMNIST. Since the solution looks like a non-trivial design problem, we prefer seeking the authors' help before working out any immature solution.

SamuelGong commented 3 years ago

One possible solution would be reformating the online resources so that each client's dataset can be fetched separately. In this way, even if clients fetch data concurrently, there will be no concurrency issue at all. Such a solution is feasible for pre-partitioned datasets such as FEMNIST (one can see an external tutorial for more details), while we are not sure if it also applies to other types of datasets (e.g., the ones we mentioned above).

baochunli commented 3 years ago

Thanks for the heads-up raising this issue. This is a known issue for a long time, but it is low in priority so this has not yet been addressed. A potential solution is to raise an error when a client found that the dataset has not yet been downloaded, with a friendly prompt for the user to run a suitable command to download the dataset first before running the full experiment. I will probably try to get this issue resolved using this idea first before something better comes up.

TL-System / plato

Concurrency Issues on Data Downloading [BUG] #102