alibaba / FederatedScope

An easy-to-use federated learning platform
https://www.federatedscope.io
Apache License 2.0
1.26k stars 206 forks source link

Issue with federate.method set to global #771

Open KKNakkav2 opened 4 months ago

KKNakkav2 commented 4 months ago

Hello,

I have launched the experiment with command

python federatedscope/main.py --cfg federatedscope/cv/baseline/fedavg_convnet2_on_cifar10.yaml federate.client_num 1 federate.sample_client_rate 1.0 federate.method global

However, it looks the model is not updated during the evaluation time. The test accuracy stays at 11% for all the rounds while the training accuracy improves.

2024-04-29 16:42:48,817 (client:357) INFO: {'Role': 'Client #1', 'Round': 0, 'Results_raw': {'train_avg_loss': 1.446278, 'train_total': 50000, 'train_acc': 0.48616, 'train_correct': 24308.0, 'train_loss': 72313.922022}}
2024-04-29 16:42:48,820 (server:344) INFO: Server: Starting evaluation at the end of round 0.
2024-04-29 16:42:50,443 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:42:50,445 (server:960) INFO: {'Role': 'Server #', 'Round': 1, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
2024-04-29 16:42:50,445 (server:350) INFO: ----------- Starting a new training round (Round #1) -------------
2024-04-29 16:42:59,204 (client:357) INFO: {'Role': 'Client #1', 'Round': 1, 'Results_raw': {'train_avg_loss': 1.096129, 'train_total': 50000, 'train_acc': 0.6146, 'train_correct': 30730.0, 'train_loss': 54806.453947}}
2024-04-29 16:42:59,206 (server:344) INFO: Server: Starting evaluation at the end of round 1.
2024-04-29 16:43:00,697 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:43:00,697 (server:960) INFO: {'Role': 'Server #', 'Round': 2, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
2024-04-29 16:43:00,697 (server:350) INFO: ----------- Starting a new training round (Round #2) -------------
2024-04-29 16:43:09,473 (client:357) INFO: {'Role': 'Client #1', 'Round': 2, 'Results_raw': {'train_avg_loss': 0.959477, 'train_total': 50000, 'train_acc': 0.66432, 'train_correct': 33216.0, 'train_loss': 47973.853004}}
2024-04-29 16:43:09,474 (server:344) INFO: Server: Starting evaluation at the end of round 2.
2024-04-29 16:43:11,000 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:43:11,000 (server:960) INFO: {'Role': 'Server #', 'Round': 3, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
2024-04-29 16:43:11,000 (server:350) INFO: ----------- Starting a new training round (Round #3) -------------
2024-04-29 16:43:19,756 (client:357) INFO: {'Role': 'Client #1', 'Round': 3, 'Results_raw': {'train_avg_loss': 0.867314, 'train_total': 50000, 'train_acc': 0.6992, 'train_correct': 34960.0, 'train_loss': 43365.681585}}
2024-04-29 16:43:19,757 (server:344) INFO: Server: Starting evaluation at the end of round 3.
2024-04-29 16:43:21,245 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:43:21,245 (server:960) INFO: {'Role': 'Server #', 'Round': 4, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
2024-04-29 16:43:21,246 (server:350) INFO: ----------- Starting a new training round (Round #4) -------------
2024-04-29 16:43:29,954 (client:357) INFO: {'Role': 'Client #1', 'Round': 4, 'Results_raw': {'train_avg_loss': 0.794839, 'train_total': 50000, 'train_acc': 0.72466, 'train_correct': 36233.0, 'train_loss': 39741.947819}}
2024-04-29 16:43:29,956 (server:344) INFO: Server: Starting evaluation at the end of round 4.
2024-04-29 16:43:31,466 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:43:31,466 (server:960) INFO: {'Role': 'Server #', 'Round': 5, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
2024-04-29 16:43:31,467 (server:350) INFO: ----------- Starting a new training round (Round #5) -------------
2024-04-29 16:43:40,151 (client:357) INFO: {'Role': 'Client #1', 'Round': 5, 'Results_raw': {'train_avg_loss': 0.730278, 'train_total': 50000, 'train_acc': 0.74824, 'train_correct': 37412.0, 'train_loss': 36513.923683}}
2024-04-29 16:43:40,153 (server:344) INFO: Server: Starting evaluation at the end of round 5.
2024-04-29 16:43:41,618 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:43:41,619 (server:960) INFO: {'Role': 'Server #', 'Round': 6, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
2024-04-29 16:43:41,619 (server:350) INFO: ----------- Starting a new training round (Round #6) -------------
2024-04-29 16:43:50,265 (client:357) INFO: {'Role': 'Client #1', 'Round': 6, 'Results_raw': {'train_avg_loss': 0.671751, 'train_total': 50000, 'train_acc': 0.77108, 'train_correct': 38554.0, 'train_loss': 33587.532097}}
2024-04-29 16:43:50,266 (server:344) INFO: Server: Starting evaluation at the end of round 6.
2024-04-29 16:43:51,771 (context:296) WARNING: No val_data or val_loader in the trainer, will skip evaluation.If this is not the case you want, please check whether there is typo for the name
2024-04-29 16:43:51,771 (server:960) INFO: {'Role': 'Server #', 'Round': 7, 'Results_raw': {'test_avg_loss': 2.301025, 'test_total': 10000, 'test_acc': 0.11, 'test_correct': 1100.0, 'test_loss': 23010.249565}}
KKNakkav2 commented 4 months ago

I think I have found one reason for this behaviour.

If the federate.method is set to global, there is no model_para broadcast (see the workers/server,py file) to the single client (worker idx is 1 I think) where the local training happens. Moreover, since the merge_test_data is set to True and make_global_eval is also set to True, the evaluation happens on the server (worker idx 0) which has never received the updated model.

I think if the method is set to global, possibly we should not activate the merge_test_data or make_global_eval. Please correct me. The same reasoning applies in the case when federate.method is set to local since there is also no broadcast in this setting as well.