Splitting the data is wrong !?

TsingZ0 / PFLlib

We expose this user-friendly algorithm library (with an integrated evaluation platform) for beginners who intend to start federated learning (FL) study

GNU General Public License v2.0

1.3k stars 275 forks source link

Splitting the data is wrong !? #194

Closed Amrusama closed 1 week ago

Amrusama commented 3 weeks ago

I generated a non-iid version of FashionMINST for 15 clients using the following command python generate_FashionMNIST.py noniid - pat The output of the data distribution on the terminal is as follows:

Screenshot 2024-07-02 141328

I printed the number of data points for each class in every client and I plotted the distribution of every client class and it didn't match the output on the terminal.

Screenshot 2024-07-02 140351 Screenshot 2024-07-02 140116

TsingZ0 commented 3 weeks ago

Please differentiate between the "entire set" and the "training set." The training set is almost 75% of the entire set for each client.

Amrusama commented 3 weeks ago

@TsingZ0 Thank you for the explanation. I suggest including this information in the dataset generation section. For instance, the entire training set of FashionMNIST comprises 70,000 samples. My intention was for PFLib to provide a non-iid version of the entire training set.

TsingZ0 commented 2 weeks ago

The test set can be reshuffled after it has been split.