FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.19k stars 787 forks source link

[fednlp]when run "bash run_simulation.sh 5" , error "MPI_ABORT was invoked on......." #594

Open stayt1 opened 2 years ago

stayt1 commented 2 years ago

Hi Chaoyang! @chaoyanghe @yuchenlin I try to reproduce some results in FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. I followed this instruction step by step. I met some troubles: Backend: when using default hyperparameters in text_classificationconfig/fedml_config_mpi.yaml The data of each client is learned and the loss can be reduced to low. But at the end of the aggregation, there will be and an fatal error:

   [FedML-Server(0) @device-id-0] [Sat, 01 Oct 2022 18:30:34] [ERROR] [mlops_runtime_log.py:34:handle_exception] Uncaught exception
   Traceback (most recent call last):
   File "torch_main.py", line 57, in <module>
    fedml_runner.run()
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/runner.py", line 123, in run
    self.runner.run()
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/simulation/simulator.py", line 67, in run
    self.fl_trainer.train()
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/simulation/sp/fedopt/fedopt_api.py", line 115, in train
    w = client.train(w_global)
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/simulation/sp/fedopt/client.py", line 29, in train
    self.model_trainer.train(self.local_training_data, self.device, self.args)
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/ml/trainer/my_model_trainer_classification.py", line 39, in train
    for batch_idx, (x, labels) in enumerate(train_data):
   ValueError: too many values to unpack (expected 2)
      -------------------------------------------------------
      Primary job  terminated normally, but 1 process returned
      a non-zero exit code.. Per user-direction, the job has been aborted.
      -------------------------------------------------------
      --------------------------------------------------------------------------
      mpirun detected that one or more processes exited with non-zero status, thus causing
      the job to be terminated. The first process to do so was:

  Process name: [[18613,1],5]
  Exit code:    1

I try to git clone the previous fednlp repository, and then git rolling back the fednlp and fedml packages to the two months ago version, which still has the problem that the model trained by MPI_ABORT cannot be [saved],(https://github.com/FedML-AI/FedNLP/issues/35) which is also the same error.

I would appreciate your help. Thanks.

iseesaw commented 2 years ago

Hello, do you solve this problem? I meet it as well :(

stayt1 commented 2 years ago

Hello, do you solve this problem? I meet it as well :(

not yet. I shoot a mail to you.

sylee0124 commented 2 years ago

This is due to MPI.COMM_WORLD.Abort() in FedMLCommManager.finish(). change it to self.com_manager.stop_receive_message() like other backends. This will require you to manually kill mpi processes or you can use abort for server process to gracefully end the mpi processes. To save the parameters place torch.save before self.finish() in func handle_message_receive_model_from_client(self, msg_params) of FedOptServerManager.

chaoyanghe commented 2 years ago

@stayt1 Hi for FedNLP project, we only support distributed computing with MPI and real cross-silo training. We will retire SP and NCCL soon.

You can save the checking point at test_all() in python/app/fednlp/text_classification/trainer/classification_aggregator.py. For the MPI.COMM_WORLD.Abort() issue. You can temporarily address it as @sylee0124 suggested. In our latest version, we will address it with a handshaking protocol.