FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
I just have one computer with one GPU, I want to run three processes on one GPU to simulte one server and two clients, so I set the gpu_mapping.yaml as mapping_default: ChaoyangHe-GPU-RTX1080Ti: [3], and run "sh run_fedavg_distributed_pytorch.sh 2 2 resnet56 homo 1 1 64 0.001 cifar10 "./../../../data/cifar10" sgd 1". But it always has the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated." How to solve the problem? Or can you tell me how to run distributed FL if I only have one computer with one GPU?
I just have one computer with one GPU, I want to run three processes on one GPU to simulte one server and two clients, so I set the gpu_mapping.yaml as mapping_default: ChaoyangHe-GPU-RTX1080Ti: [3], and run "sh run_fedavg_distributed_pytorch.sh 2 2 resnet56 homo 1 1 64 0.001 cifar10 "./../../../data/cifar10" sgd 1". But it always has the error "mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated." How to solve the problem? Or can you tell me how to run distributed FL if I only have one computer with one GPU?