fedml_experiments/distributed/fedavg run_fedavg_distributed_pytorch.sh stuck

FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

https://TensorOpera.ai

Apache License 2.0

4.2k stars 788 forks source link

fedml_experiments/distributed/fedavg run_fedavg_distributed_pytorch.sh stuck #126

Closed wangzhenzhou2020 closed 2 years ago

wangzhenzhou2020 commented 3 years ago

find the problem, for every worker(process):

eventually, it stucks at:

wangzhenzhou2020 commented 3 years ago

anybody encountered this problems？

zhang-tuo-pdf commented 3 years ago

Please comment or delete line 62-64 in FedAvgClientManager.py and line 59-63 in FedAvgServerManager.py. This will fix your problem

wangzhenzhou2020 commented 3 years ago

thanks. i've solved that problem (just as you said) two weeks ago. But , i've encounted another problem: at the final round, FedAvgServerManager.py call FedAVGAggregator.py(the red arrow) .

But in the FedAvgAggregator.py , test_on_server_for_all_clients(), sometimes , the last line , logging.info(stats) , can't log the test_acc and test_loss.

At the last line , it logs the test_acc and test_loss. the here 1 ... here 6 are used for helping me debug.

Now , first, i show the normal ways the code ends:

You can see the test_acc and test_loss (In FedAvgAggregator.py ). and __finish (In FedAvgServerManager.py )

But, sometimes the code ends weird:

Or:

It may ends at here 1 ...... here 6 . It don’t log the test_acc and test_loss.

Do you have the same problem？

897123856413 commented 3 years ago

怎么解决，对应不上，这个文件在什么下面啊

897123856413 commented 3 years ago

代码行数不对应啊

，还是卡着呢

897123856413 commented 3 years ago

Please comment or delete line 62-64 in FedAvgClientManager.py and line 59-63 in FedAvgServerManager.py. This will fix your problem还是不行，卡在中间呢

jackdoll commented 2 years ago

I want to ask you how you run run_fedavg_distributed_pytorch.sh successfully? I tried to run it on a single computer with a single GPU, but it always told me the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated."

rG223 commented 2 years ago

@wangzhenzhou2020 I have a question about distribution, why gpu usage is very strange, CUDA0 is the fastest and CUDA3 is the slowest, there is a 10-fold difference in speed, and slow Gpus will run out of memory later.

chaoyanghe commented 2 years ago

@rG223 @jackdoll @897123856413

please check our new examples at: https://github.com/FedML-AI/FedML/tree/master/python/examples

We've upgraded our library a lot in recent version. Here is a brief introduction: https://medium.com/@FedML/fedml-ai-platform-releases-the-worlds-federated-learning-open-platform-on-public-cloud-with-an-8024e68a70b6