Training Loss become nan when do the shakespeare experiment under distributed environment

FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

https://TensorOpera.ai

Apache License 2.0

4.2k stars 787 forks source link

Training Loss become nan when do the shakespeare experiment under distributed environment #60

Closed iuserea closed 4 years ago

iuserea commented 4 years ago

Code: just insert one line of code in the file FedML/fedml_experiments/distributed/fedavg/mian_fedavg.py as shown in the fig1 below, others are just based on the latest orgin/master.

Cmd: sh run_fedavg_distributed_pytorch.sh 10 10 1 8 rnn hetero 100 10 10 0.8 shakespeare "./../../../data/shakespeare" 0

Result:

wandb:

Quesion: From wandb's screen shot,we can also notice that train/acc and test/acc are not so high. Is it normal?Does your guys have any advice to solve this? Thanks for your attention.

chaoyanghe commented 4 years ago

@iuserea It seems your non-IID version has lower accuracy. For this case, you need to turn your hyper-parameters. You can use grid search to find a better learning rate, batch size, local epoch, rounds. Normally, IID and non-IID should use totally different hypera-parameters.

chaoyanghe commented 4 years ago

@iuserea have you resolved your issue?

iuserea commented 4 years ago

@chaoyanghe I can't thank you enough! After trying grid search and random search,the acc became 40%.

chaoyanghe commented 4 years ago

@iuserea sounds good. Can you get results reported from this paper: https://arxiv.org/pdf/2003.00295.pdf ?

iuserea commented 4 years ago

@chaoyanghe it's possible to get results from this paper.I've read it before.But the acc there is about 57% which is more than 40% of mine.