diaoenmao / HeteroFL-Computation-and-Communication-Efficient-Federated-Learning-for-Heterogeneous-Clients

[ICLR 2021] HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients
MIT License
154 stars 33 forks source link

Non-iid test fail #5

Closed ypwang61 closed 2 years ago

ypwang61 commented 2 years ago

Hello! I'm trying to reproduct your paper, but fail in the case of non-iid data on cifar10. I just input command like you recommend:

python train_classifier_fed.py --data_name CIFAR10 --model_name resnet18 --control_name 1_10_0.1_non-iid-2_dynamic_a1-e1_bn_0_0

But I get only 50% local accuracy and 10% accuracy after more than 400 epochs, is there some problem? The output is like:

Model: 0_CIFAR10_label_resnet18_1_10_0.1_non-iid-2_dynamic_a1-e1_bn_0_0 Train Epoch: 441(0%) Local-Loss: 0.7467 Local-Accuracy: 50.1040 ID: 8(1/1) Learning rate: 0.010000000000000002 Rate: 0.0625 Epoch Finished Time: 0:00:00 Experiment Finished Time: 8:28:55 Model: 0_CIFAR10_label_resnet18_1_10_0.1_non-iid-2_dynamic_a1-e1_bn_0_0 Test Epoch: 441(100%) Local-Loss: 2.3063 Local-Accuracy: 10.0000 Global-Loss: 2.3063 Global-Accuracy: 10.0000

diaoenmao commented 2 years ago

Thanks for your interest in our work. There are something wrong with the tensorboard logger. I update the source code. Here is what I get. Please let me know if you need any other assistance.

Model: 0_CIFAR10_label_resnet18_1_10_0.1_non-iid-2_dynamic_a1-e1_bn_1_1 Train Epoch: 2(0%) Local-Loss: 0.3907 Local-Accuracy: 83.5240 ID: 4(1/1) Learning rate: 0.1 Rate: 1.0 Epoch Finished Time: 0:00:00 Experiment Finished Time: 10:13:44 Model: 0_CIFAR10_label_resnet18_1_10_0.1_non-iid-2_dynamic_a1-e1_bn_1_1 Test Epoch: 2(100%) Local-Loss: 2.5939 Local-Accuracy: 47.9500 Global-Loss: 6.8864 Global-Accuracy: 17.4800

ypwang61 commented 2 years ago

Thanks, but have you updated the logger for the conv? I try to run your code with conv for cifar 10 but also always get 10% for global accuracy

Like this: Model: 0_CIFAR10_label_conv_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Train Epoch: 31(0%) Local-Loss: 0.7120 Local-Accuracy: 50.2400 ID: 2(1/4) Learning rate: 0.1 Rate: 1.0 Epoch Finished Time: 0:01:29.412888 Experiment Finished Time: 1 day, 1:29:27.412888 Model: 0_CIFAR10_label_conv_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Train Epoch: 31(50%) Local-Loss: 0.7127 Local-Accuracy: 50.1840 ID: 0(3/4) Learning rate: 0.1 Rate: 1.0 Epoch Finished Time: 0:00:28.520539 Experiment Finished Time: 1 day, 0:22:37.520539 Model: 0_CIFAR10_label_conv_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Test Epoch: 31(100%) Local-Loss: 0.6738 Local-Accuracy: 52.6316 Global-Loss: 2.4238 Global-Accuracy: 10.0000

diaoenmao commented 2 years ago

The logger is universal for different model architecture. This is what I get from the same exp for the first three epochs. Model: 0_CIFAR10_label_conv_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Test Epoch: 1(100%) Local-Loss: 1.5772 Local-Accuracy: 54.9684 Global-Loss: 5.4600 Global-Accuracy: 18.4600 Model: 0_CIFAR10_label_conv_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Test Epoch: 2(100%) Local-Loss: 0.6301 Local-Accuracy: 70.2421 Global-Loss: 3.5569 Global-Accuracy: 18.3900 Model: 0_CIFAR10_label_conv_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Test Epoch: 3(100%) Local-Loss: 0.5531 Local-Accuracy: 72.0842 Global-Loss: 3.0413 Global-Accuracy: 22.6000

ypwang61 commented 2 years ago

Thank you for your help.But I wonder what's the clients settings in your report for non-iid cases with resnet model and CIFAR10 datasets? I try the default setting but get a low global accuracy(I should have used masked entropy) for 800 epochs:

Model: 0_CIFAR10_label_resnet18_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Train Epoch: 800(0%) Local-Loss: 0.0205 Local-Accuracy: 99.6120 ID: 5(1/4) Learning rate: 0.0010000000000000002 Rate: 0.5 Epoch Finished Time: 0:05:12.298046 Experiment Finished Time: 0:05:12.298046 Model: 0_CIFAR10_label_resnet18_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Train Epoch: 800(50%) Local-Loss: 0.0147 Local-Accuracy: 99.7093 ID: 3(3/4) Learning rate: 0.0010000000000000002 Rate: 1.0 Epoch Finished Time: 0:01:44.859544 Experiment Finished Time: 0:01:44.859544 Model: 0_CIFAR10_label_resnet18_1_10_0.4_non-iid-2_fix_a4-b3-c3_bn_1_1 Test Epoch: 800(100%) Local-Loss: 2.4535 Local-Accuracy: 70.7200 Global-Loss: 4.1188 Global-Accuracy: 19.3200

diaoenmao commented 2 years ago

I use make.py to generate my bash script. In the paper, I use 100 users and 0.1 active rate. The 10 users example in the readme.md is just for demonstration. If you have 10 clients and each of them only has two classes of data, I will suggest you to reduce the number of local epoch in utils.py from 5 to 1, or even try FedSGD. It is known that when training multiple local epochs, the global performance of non-iid data is bad.

ypwang61 commented 2 years ago

Thank you! Then if I use the make.py to generate the bash script with non-iid case, should each user get 2 classes(non-iid-2) data as that in your paper?

diaoenmao commented 2 years ago

Yes. The code for data partition is in data.py