FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.19k stars 786 forks source link

fedml_experiments/distributed/fedavg run_fedavg_distributed_pytorch.sh stuck in one computer with one GPU #166

Closed jackdoll closed 1 year ago

jackdoll commented 2 years ago

I just have one computer with one GPU, I want to run three processes on one GPU to simulte one server and two clients, so I set the gpu_mapping.yaml as mapping_default: ChaoyangHe-GPU-RTX1080Ti: [3], and run "sh run_fedavg_distributed_pytorch.sh 2 2 resnet56 homo 1 1 64 0.001 cifar10 "./../../../data/cifar10" sgd 1". But it always has the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated." How to solve the problem? Or can you tell me how to run distributed FL if I only have one computer with one GPU?

KOUDA-AMINE commented 2 years ago

Did you change the 'ChaoyangHe-GPU-RTX1080Ti' to your hostname?

chaoyanghe commented 2 years ago

@jackdoll is this issue solved in the latest examples? https://github.com/FedML-AI/FedML/tree/master/python/examples

fedml-dimitris commented 1 year ago

Closing due to inactivity.