FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.2k stars 788 forks source link

Processes running are suspend by some unknown reasom. #69

Closed iuserea closed 4 years ago

iuserea commented 4 years ago

When I run the fedgkt algorithm by the following cmd. sh run_FedGKT.sh 8 cifar10 homo 10 20 1 Adam 0.001 1 0 resnet56 fedml_resnet56_homo_cifar10 "./../../../data/cifar10" 64

The processes are often suspend by some reason.I derived the result successfully for only one time.

image The one I figure it our is that the connection error between the process and wandb. After solving the connection problem,there are still other potential reasons. How can I figure it out?

chaoyanghe commented 4 years ago

you can press ctrl+c to see what's the error.

chaoyanghe commented 4 years ago

Have you figured out the problem?

hosytuyen commented 2 years ago

Hi, @chaoyanghe @iuserea I also faced this problem after the first epochs. Did you solve this problem?

Thank you

image

hosytuyen commented 2 years ago

Thank you, I have fixed the problem.

fedml_api/distributed/fedgkt/GKTServerTrainer.py at line 117:

epochs_server = self.args.self.args.epochs_server --> epochs_server = self.args.epochs_server