Closed wizard1203 closed 2 years ago
@wizard1203 Hi, in FL community, researchers would like to fix the non-IID distribution for fair comparison, so our current data loaders all align with the previous publication. Please refer to our white paper for details (the benchmark section). If you publish a paper that uses different distribution, you have to rerun all baselines to get the result and then demonstrate the advantages of your algorithms, which is hard for reviewers to confirm the "Authenticity".
In terms of flexibility, I agree with you that we should allow the users to customize the distribution. This is also a direction that our FedML should improve. If it's not urgent, I will ask another engineer/volunteer to finish this development. Please stay tuned.
@wizard1203 Hi, in FL community, researchers would like to fix the non-IID distribution for fair comparison, so our current data loaders all align with the previous publication. Please refer to our white paper for details (the benchmark section). If you publish a paper that uses different distribution, you have to rerun all baselines to get the result and then demonstrate the advantages of your algorithms, which is hard for reviewers to confirm the "Authenticity".
In terms of flexibility, I agree with you that we should allow the users to customize the distribution. This is also a direction that our FedML should improve. If it's not urgent, I will ask another engineer/volunteer to finish this development. Please stay tuned.
@chaoyanghe Thanks for your detailed explanation. Yes it will be more convenient for researchers to conduct experiments without reruning the baselines. Furthermore, I have following concerns (Not related to FedML, but federated learning):
For FedAvg, that the number of computing clients is set as 50 (choose ratio=0.5), can be handled by <= 10 gpus when using resnet20. But if we use larger models, I guess many people have so many gpu devices. So, maybe in the future, we can split non-iid Imagenet following a more flexible way. For example, the number of joining clients can be set as 2, 4, 8, 10, 16, 20, 32, 50, 64, 100 etc. And 10, 32, 64, 100 can be four good baselines to be compared by researchers. If you are interested in this, I would like to split the Imagenet and conduct some benchmark experiments with you. I have enough gpu resources (4 x 16 2080 Ti). Maybe this can be a new paper?
In Decentralized training and asynchronous centralized training, all models are different. So the number of models must be same with the number of noniid parts of datasets, which maybe a huge memory comsumption especially when the model is large. But maybe this is inevitable.
@wizard1203 Hi, in FL community, researchers would like to fix the non-IID distribution for fair comparison, so our current data loaders all align with the previous publication. Please refer to our white paper for details (the benchmark section). If you publish a paper that uses different distribution, you have to rerun all baselines to get the result and then demonstrate the advantages of your algorithms, which is hard for reviewers to confirm the "Authenticity". In terms of flexibility, I agree with you that we should allow the users to customize the distribution. This is also a direction that our FedML should improve. If it's not urgent, I will ask another engineer/volunteer to finish this development. Please stay tuned.
@chaoyanghe Thanks for your detailed explanation. Yes it will be more convenient for researchers to conduct experiments without reruning the baselines. Furthermore, I have following concerns (Not related to FedML, but federated learning):
- For FedAvg, that the number of computing clients is set as 50 (choose ratio=0.5), can be handled by <= 10 gpus when using resnet20. But if we use larger models, I guess many people have so many gpu devices. So, maybe in the future, we can split non-iid Imagenet following a more flexible way. For example, the number of joining clients can be set as 2, 4, 8, 10, 16, 20, 32, 50, 64, 100 etc. And 10, 32, 64, 100 can be four good baselines to be compared by researchers. If you are interested in this, I would like to split the Imagenet and conduct some benchmark experiments with you. I have enough gpu resources (4 x 16 2080 Ti). Maybe this can be a new paper?
- In Decentralized training and asynchronous centralized training, all models are different. So the number of models must be same with the number of noniid parts of datasets, which maybe a huge memory comsumption especially when the model is large. But maybe this is inevitable.
Yes. I am also considering developing FedML + ImageNet in near future. If you are interested in this research, we can collaborate. We can add some research ideas to improve this work.
It seems that the number of joining clients (not the num of computing clients) is fixed in fedml_api/data_preprocessing/**/data_loader and cannot be changed except CIFAR10 datasets.
Here I mean that it seems the total clients is decided by the datasets, rather the input from run_fedavg_distributed_pytorch.sh.
https://github.com/FedML-AI/FedML/blob/3d9fda8d149c95f25ec4898e31df76f035a33b5d/fedml_api/data_preprocessing/MNIST/data_loader.py#L112
Specifically, the variable client_num_in_total is changed here by the data_loader depending on the dataset. So I have 2 questions as following:
How should we partition the whole dataset into client_num_in_total parts which is defined by input?
Maybe there is a better way to write some APIs to split the data from the original dataset rather than to make a new dataset, because this will be more convenient and flexible? Then users can download the original datasets then use the data split API, saving the space of disks.