Closed wizard1203 closed 3 years ago
From a performance point of view, this is the only way to do it. You certainly don't want to partition the data — it's going to be extremely slow if you do it using the file system, and most datasets don't fit into the physical memory.
Conceptually, data for each client can certainly change over multiple iterations over time. It doesn't have to stay the same. However, since the random seed for each client is fixed to its client ID, the dataset each client obtained in Plato will remain the same across iterations.
From a performance point of view, this is the only way to do it. You certainly don't want to partition the data — it's going to be extremely slow if you do it using the file system, and most datasets don't fit into the physical memory.
Conceptually, data for each client can certainly change over multiple iterations over time. It doesn't have to stay the same. However, since the random seed for each client is fixed to its client ID, the dataset each client obtained in Plato will remain the same across iterations.
Thanks for your careful explanation.
For the performance point, could we use the same partition way of FedML, or some other codes, giving each client the indexes of its local dataset? Then we also need not to worry the performance problem?
For the random seed, yes, if it does not change, then clients won't see other clients' data. But it seems that there are still some bugs, i.e. client i and client j may have some common samples at the same iterations, because the WeightedRandomSampler can only give clients a specific sampling probability but not specific indexes of datasets. It cannot divide the whole datasets into disjoint local datasets. This may cause accuracy loss because some samples are missed (If some samples are common, there must be some samples missed).
The Federated EMNIST dataset has been provided to do exactly this: each client will load its own pre-partitioned dataset.
Regarding different clients getting some common samples, this can be the case when the dataset is very small or when the number of clients is large. However, this happens all the time in the real world: you cannot expect all the clients will have precisely partitioned data -- that's just a simulation. If I use a smart keyboard, my next word would very likely to be the same as someone else. If this causes any "accuracy loss," then so be it. That's what happens in the real world. You just cannot get the same kind of accuracy with carefully curated and precisely partitioned data if the total number of samples in the entire dataset is small and there are a lot of clients. This is actually why academic papers don't really reflect what happens in the real world.
The Federated EMNIST dataset has been provided to do exactly this: each client will load its own pre-partitioned dataset.
Regarding different clients getting some common samples, this can be the case when the dataset is very small or when the number of clients is large. However, this happens all the time in the real world: you cannot expect all the clients will have precisely partitioned data -- that's just a simulation. If I use a smart keyboard, my next word would very likely to be the same as someone else. If this causes any "accuracy loss," then so be it. That's what happens in the real world. You just cannot get the same kind of accuracy with carefully curated and precisely partitioned data if the total number of samples in the entire dataset is small and there are a lot of clients. This is actually why academic papers don't really reflect what happens in the real world.
Glad to know that the FEMNIST dataset was cued. Actually, we can plug in more pre-partitioned datasets like FEMNIST, e.g., those used in LEAF or FedScale, in a similar way. Perhaps one can see here for detailed tutorials!
Thanks for your careful explanation. Now let me close this issue.
For the Non-IID Dirichlet partition, it seems that each client gets a WeightedRandomSampler of dirichlet distribution to sample data from the whole dataset, instead of a real partitioned local dataset.
This may cause a problem that each client can see some new data samples that it has never saw before. This may contradict to the NonIID settings, in which each client can never see other clients' samples.