How the results in Table 1 were obtained?

1170300714 commented 5 months ago

Thanks for your great work and great code!

I reproduce the experiments on MNIST under NIID Partition Strategy : Sharding with FedAvg and Fedprox. I modify the dataset name to mnist in both fedavg.json and fedprox.json, and the local_epochs is modified to 3. I got the [Server Stat] Acc = 86 at round 67 for FedAvg and got [Server Stat] Acc = 90 at round 67 for Fedprox, which is obviously higher than their reported in Tab.1 of your paper.

So my question is how the results in Table 1 were obtained? Were the results obtained by averaging the performance of each round instead of selecting the best acc in a certain round?

In fact, I am new to federated learning and not very familiar with the methods of performance calculation. I appreciate your guidance.

Lee-Gihun commented 5 months ago

Thank you for engaging with our research!

Indeed, the performance in federated learning can vary significantly from one round to another due to numerous factors including the level of data heterogeneity, learning rates, the ratio of participating clients, and the number of epochs completed locally in each round.

In our FedNTD paper, we addressed these fluctuations by (1) applying an exponential moving average (EMA) with a smoothing factor of 0.6 to the accuracy across rounds and report the final EMA value, and (2) conducting ablation study across various learning factors.

I think it's common in FL research to report the averaged accuracy at the final round over multiple runs with different seeds. While we utilized the EMA approach, any consistent and transparent method of reporting, including the strategy you mentioned, is generally acceptable, provided it consistently reflects reliable results.

(+ If you use W&B, they provides the EMA function in the dashboard.)

1170300714 commented 5 months ago

收到，感谢您的来件

1170300714 commented 5 months ago

Thanks for your answer!

I have finished the training of FedAvg on mnist.

The last several lines of logs are reports below:

[Round 199/200] Elapsed 49.6s (Current Time: 00:56:02)
[Local Stat (Train Acc)]: [0.9736, 0.9953, 0.9115, 0.9435, 0.9905, 0.9295, 0.9698, 0.9032, 1.0, 0.9852], Avg - 0.96 (std 0.03)
[Local Stat (Test Acc)]: [0.5191, 0.3573, 0.4135, 0.2917, 0.5635, 0.4424, 0.3003, 0.5095, 0.1082, 0.3856], Avg - 0.39 (std 0.13)
[Server Stat] Acc - 0.94

[Round 200/200] Elapsed 49.5s (Current Time: 00:56:52)
[Local Stat (Train Acc)]: [1.0, 0.9857, 0.9798, 0.9362, 0.8648, 0.9937, 0.9868, 0.9831, 0.9845, 0.9787], Avg - 0.97 (std 0.04)
[Local Stat (Test Acc)]: [0.3482, 0.4239, 0.4297, 0.7615, 0.4015, 0.5942, 0.472, 0.5408, 0.4242, 0.2409], Avg - 0.46 (std 0.14)
[Server Stat] Acc - 0.94

wandb: WARNING Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", bas e_path="/mnt")
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: in_dist_acc_mean ▁▄▃▃▅▇▇▇█▇▇▆▆▇▇▇▆▇▇████████▇█▇█▇▇█▇█████
wandb: in_dist_acc_prop ▁▄▃▃▅▇▇▇█▇▇▆▆▇▇▇▆▇▇████████▇█▇█▇▇█▇█████
wandb: in_dist_acc_std █▃▆▄▂▁▂▂▁▂▂▂▄▁▁▂▃▁▂▁▁▁▁▁▁▁▁▂▁▃▁▃▂▁▂▁▁▁▁▂
wandb: in_dout_acc_mean ▁▁▁▁▁▁▁▁▂▁▂▁▂▁▂▂▂▃▂▂▃▂▃▃▄▃▆▅▄▄▇▅▅▄▆▆▄█▇▇
wandb: local_test_acc ▁▁▁▁▁▂▁▁▂▂▂▁▂▁▂▂▂▃▂▃▃▂▃▃▄▃▆▅▄▄▇▅▅▄▆▆▄█▇▇
wandb: local_train_acc ▁▃▃▃▄▆▇▆▆▆▆▅▇▇▇▇▆▇▇▇█▇▇▇▇▇█▆▇▇█▇▇█▇██▇██
wandb: out_dist_acc ▁▁▁▁▁▁▁▁▂▁▁▁▂▁▂▂▂▃▂▂▃▂▃▃▄▃▆▅▄▄▇▅▅▄▆▆▄█▇▇
wandb: server_test_acc ▂▁▃▂▄▅▄▆▆▆▆▇▆▅▇▇▇█▆█▇▇██▇████▆█▇████████
wandb:
wandb: Run summary:
wandb: in_dist_acc_mean 0.99216
wandb: in_dist_acc_prop 0.9924
wandb: in_dist_acc_std 0.01795
wandb: in_dout_acc_mean 0.40496
wandb: local_test_acc 0.46369
wandb: local_train_acc 0.96933
wandb: out_dist_acc 0.40771
wandb: server_test_acc 0.9407

It seems that the "server_test_acc" is 94.07, which is significantly higher than the results reported in your Tab. 1. So, this is the result without EMA, right? (That is, the result from the last round?) Do we need to manually operate Wandb to get the results with EMA?

And the following is config corresponding to this training. I would really appreciate it if you could help me check if there are any issues with this config~

{'data_setups': {'batch_size': 50,
'dataset_name': 'mnist',
'n_clients': 100,
'partition': {'method': 'sharding', 'shard_per_user': 2},
'root': './data'},
'train_setups': {'algo': {'name': 'fedavg', 'params': {}},
'model': {'name': 'fedavg_mnist', 'params': {}},
'optimizer': {'params': {'lr': 0.01,
'momentum': 0.9,
'weight_decay': 1e-05}},
'scenario': {'device': 'cuda:0',
'local_epochs': 3,
'n_rounds': 200,
'sample_ratio': 0.1},
'scheduler': {'enabled': True,
'name': 'step',
'params': {'gamma': 0.99, 'step_size': 1}},
'seed': 2022},
'wandb_setups': {'group': 'fedavg',
'name': 'fedavg',
'project': 'NeurIPS2022'}}

1170300714 commented 5 months ago

Hello，this is the results after EMA with 0.6. It seems that the "server_test_acc" after EMA is 93.42, which is also significantly higher than the results reported in your Tab. 1.

Lee-Gihun commented 5 months ago

Apologies, I think I misunderstood your earlier question. You were asking about the performance gap between the official code and the reported values in the paper. This gap arises from using the CutOut technique in our experiment.

For more details, please check the discussion on MNIST performance in the official NeurIPS2022 Review - Performance of FedAvg on MNIST, where we discussed about the MNIST performance with the reviewer sAmP.

We chose to omit the cutout technique from the MNIST dataset in the released code to make the results clearer and more straightforward, as a typical training setup achieves over 94% accuracy, similar to what you observed. (+The configuration seems okay, and you are free to adjust the configuration settings as you see fit, including changing the number of local epochs to 3.)

Lee-Gihun commented 5 months ago

According to our previous experiment, FedAvg with Sharding (s=2) with the number of local epochs 5 achieves a 95.8% server test accuracy, which appears to be quite consistent with your results.

1170300714 commented 5 months ago

Thank you! My confusion is resolved~

Lee-Gihun / FedNTD

How the results in Table 1 were obtained? #4