Closed wangzhenzhou2020 closed 2 years ago
anybody encountered this problems?
Please comment or delete line 62-64 in FedAvgClientManager.py and line 59-63 in FedAvgServerManager.py. This will fix your problem
thanks. i've solved that problem (just as you said) two weeks ago. But , i've encounted another problem: at the final round, FedAvgServerManager.py call FedAVGAggregator.py(the red arrow) .
But in the FedAvgAggregator.py , test_on_server_for_all_clients(), sometimes , the last line , logging.info(stats) , can't log the test_acc and test_loss.
At the last line , it logs the test_acc and test_loss. the here 1 ... here 6 are used for helping me debug.
Now , first, i show the normal ways the code ends:
You can see the test_acc and test_loss (In FedAvgAggregator.py ). and __finish (In FedAvgServerManager.py )
But, sometimes the code ends weird:
Or:
It may ends at here 1 ...... here 6 . It don’t log the test_acc and test_loss.
Do you have the same problem?
怎么解决,对应不上,这个文件在什么下面啊
代码行数不对应啊
,还是卡着呢
Please comment or delete line 62-64 in FedAvgClientManager.py and line 59-63 in FedAvgServerManager.py. This will fix your problem还是不行,卡在中间呢
I want to ask you how you run run_fedavg_distributed_pytorch.sh successfully? I tried to run it on a single computer with a single GPU, but it always told me the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated."
@wangzhenzhou2020 I have a question about distribution, why gpu usage is very strange, CUDA0 is the fastest and CUDA3 is the slowest, there is a 10-fold difference in speed, and slow Gpus will run out of memory later.
@rG223 @jackdoll @897123856413
please check our new examples at: https://github.com/FedML-AI/FedML/tree/master/python/examples
We've upgraded our library a lot in recent version. Here is a brief introduction: https://medium.com/@FedML/fedml-ai-platform-releases-the-worlds-federated-learning-open-platform-on-public-cloud-with-an-8024e68a70b6
find the problem, for every worker(process):
eventually, it stucks at: