TL-System / plato

A federated learning framework to support scalable and reproducible research
Apache License 2.0
336 stars 79 forks source link

Concurrency Issues on Data Downloading [BUG] #102

Closed SamuelGong closed 3 years ago

SamuelGong commented 3 years ago

Describe the bug For those datasets that are not shipped by torch and thus have to be manually downloaded (e.g., cinic10, multimodal_base, pascal_voc, and tiny_imagenet), they are currently downloaded as a whole (i.e., the whole training and testing datasets) in the constructors of the respective DataSource instances.

While this design may function well in the testing environment where servers and all the clients colocate in one machine, it may come across with severe concurrency issues in some situations such as that in Deploying a Plato Federated Learning Server in a Production Environment, which Plato also aims to support.

To see that, consider the two cases separately:

To Reproduce This bug should conceptually make sense. We may provide the steps for reproducing it later, if necessary.

Additional context We spotted this bug during the development of a new feature FEMNIST. Since the solution looks like a non-trivial design problem, we prefer seeking the authors' help before working out any immature solution.

SamuelGong commented 3 years ago

One possible solution would be reformating the online resources so that each client's dataset can be fetched separately. In this way, even if clients fetch data concurrently, there will be no concurrency issue at all. Such a solution is feasible for pre-partitioned datasets such as FEMNIST (one can see an external tutorial for more details), while we are not sure if it also applies to other types of datasets (e.g., the ones we mentioned above).

baochunli commented 3 years ago

Thanks for the heads-up raising this issue. This is a known issue for a long time, but it is low in priority so this has not yet been addressed. A potential solution is to raise an error when a client found that the dataset has not yet been downloaded, with a friendly prompt for the user to run a suitable command to download the dataset first before running the full experiment. I will probably try to get this issue resolved using this idea first before something better comes up.