TL-System / plato

A federated learning framework to support scalable and reproducible research
Apache License 2.0
344 stars 80 forks source link

[BUG] NonIID partition issues #104

Closed wizard1203 closed 3 years ago

wizard1203 commented 3 years ago

For the Non-IID Dirichlet partition, it seems that each client gets a WeightedRandomSampler of dirichlet distribution to sample data from the whole dataset, instead of a real partitioned local dataset.

This may cause a problem that each client can see some new data samples that it has never saw before. This may contradict to the NonIID settings, in which each client can never see other clients' samples.

baochunli commented 3 years ago

From a performance point of view, this is the only way to do it. You certainly don't want to partition the data — it's going to be extremely slow if you do it using the file system, and most datasets don't fit into the physical memory.

Conceptually, data for each client can certainly change over multiple iterations over time. It doesn't have to stay the same. However, since the random seed for each client is fixed to its client ID, the dataset each client obtained in Plato will remain the same across iterations.

wizard1203 commented 3 years ago

From a performance point of view, this is the only way to do it. You certainly don't want to partition the data — it's going to be extremely slow if you do it using the file system, and most datasets don't fit into the physical memory.

Conceptually, data for each client can certainly change over multiple iterations over time. It doesn't have to stay the same. However, since the random seed for each client is fixed to its client ID, the dataset each client obtained in Plato will remain the same across iterations.

Thanks for your careful explanation.

For the performance point, could we use the same partition way of FedML, or some other codes, giving each client the indexes of its local dataset? Then we also need not to worry the performance problem?

For the random seed, yes, if it does not change, then clients won't see other clients' data. But it seems that there are still some bugs, i.e. client i and client j may have some common samples at the same iterations, because the WeightedRandomSampler can only give clients a specific sampling probability but not specific indexes of datasets. It cannot divide the whole datasets into disjoint local datasets. This may cause accuracy loss because some samples are missed (If some samples are common, there must be some samples missed).

baochunli commented 3 years ago

The Federated EMNIST dataset has been provided to do exactly this: each client will load its own pre-partitioned dataset.

Regarding different clients getting some common samples, this can be the case when the dataset is very small or when the number of clients is large. However, this happens all the time in the real world: you cannot expect all the clients will have precisely partitioned data -- that's just a simulation. If I use a smart keyboard, my next word would very likely to be the same as someone else. If this causes any "accuracy loss," then so be it. That's what happens in the real world. You just cannot get the same kind of accuracy with carefully curated and precisely partitioned data if the total number of samples in the entire dataset is small and there are a lot of clients. This is actually why academic papers don't really reflect what happens in the real world.

SamuelGong commented 3 years ago

Glad to know that the FEMNIST dataset was cued. Actually, we can plug in more pre-partitioned datasets like FEMNIST, e.g., those used in LEAF or FedScale, in a similar way. Perhaps one can see here for detailed tutorials!

wizard1203 commented 3 years ago

The Federated EMNIST dataset has been provided to do exactly this: each client will load its own pre-partitioned dataset.

Regarding different clients getting some common samples, this can be the case when the dataset is very small or when the number of clients is large. However, this happens all the time in the real world: you cannot expect all the clients will have precisely partitioned data -- that's just a simulation. If I use a smart keyboard, my next word would very likely to be the same as someone else. If this causes any "accuracy loss," then so be it. That's what happens in the real world. You just cannot get the same kind of accuracy with carefully curated and precisely partitioned data if the total number of samples in the entire dataset is small and there are a lot of clients. This is actually why academic papers don't really reflect what happens in the real world.

Glad to know that the FEMNIST dataset was cued. Actually, we can plug in more pre-partitioned datasets like FEMNIST, e.g., those used in LEAF or FedScale, in a similar way. Perhaps one can see here for detailed tutorials!

Thanks for your careful explanation. Now let me close this issue.