[fednlp]when run "bash run_simulation.sh 5" , error "MPI_ABORT was invoked on......."

stayt1 commented 2 years ago

Hi Chaoyang! @chaoyanghe @yuchenlin I try to reproduce some results in FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. I followed this instruction step by step. I met some troubles： Backend: when using default hyperparameters in text_classificationconfig/fedml_config_mpi.yaml The data of each client is learned and the loss can be reduced to low. But at the end of the aggregation, there will be and an fatal error：

backend is MPI, it will output:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

as the same with https://github.com/FedML-AI/FedNLP/issues/35 This will lead to kill the MPI and can not save the model.

backend is sp, it will output:

   [FedML-Server(0) @device-id-0] [Sat, 01 Oct 2022 18:30:34] [ERROR] [mlops_runtime_log.py:34:handle_exception] Uncaught exception
   Traceback (most recent call last):
   File "torch_main.py", line 57, in <module>
    fedml_runner.run()
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/runner.py", line 123, in run
    self.runner.run()
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/simulation/simulator.py", line 67, in run
    self.fl_trainer.train()
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/simulation/sp/fedopt/fedopt_api.py", line 115, in train
    w = client.train(w_global)
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/simulation/sp/fedopt/client.py", line 29, in train
    self.model_trainer.train(self.local_training_data, self.device, self.args)
   File "/home/tao/anaconda3/envs/fednlp/lib/python3.8/site-packages/fedml/ml/trainer/my_model_trainer_classification.py", line 39, in train
    for batch_idx, (x, labels) in enumerate(train_data):
   ValueError: too many values to unpack (expected 2)
      -------------------------------------------------------
      Primary job  terminated normally, but 1 process returned
      a non-zero exit code.. Per user-direction, the job has been aborted.
      -------------------------------------------------------
      --------------------------------------------------------------------------
      mpirun detected that one or more processes exited with non-zero status, thus causing
      the job to be terminated. The first process to do so was:

  Process name: [[18613,1],5]
  Exit code:    1

backend is NCCL, it will output:

ModuleNotFoundError: No module named 'fedml.simulation.nccl'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

 Process name: [[16906,1],0]
 Exit code:    1
--------------------------------------------------------------------------

I try to git clone the previous fednlp repository, and then git rolling back the fednlp and fedml packages to the two months ago version, which still has the problem that the model trained by MPI_ABORT cannot be [saved],(https://github.com/FedML-AI/FedNLP/issues/35) which is also the same error.

I would appreciate your help. Thanks.

iseesaw commented 2 years ago

Hello, do you solve this problem? I meet it as well :(

stayt1 commented 2 years ago

Hello, do you solve this problem? I meet it as well :(

not yet. I shoot a mail to you.

sylee0124 commented 2 years ago

This is due to MPI.COMM_WORLD.Abort() in FedMLCommManager.finish(). change it to self.com_manager.stop_receive_message() like other backends. This will require you to manually kill mpi processes or you can use abort for server process to gracefully end the mpi processes. To save the parameters place torch.save before self.finish() in func handle_message_receive_model_from_client(self, msg_params) of FedOptServerManager.

chaoyanghe commented 2 years ago

@stayt1 Hi for FedNLP project, we only support distributed computing with MPI and real cross-silo training. We will retire SP and NCCL soon.

You can save the checking point at test_all() in python/app/fednlp/text_classification/trainer/classification_aggregator.py. For the MPI.COMM_WORLD.Abort() issue. You can temporarily address it as @sylee0124 suggested. In our latest version, we will address it with a handshaking protocol.

FedML-AI / FedML

[fednlp]when run "bash run_simulation.sh 5" , error "MPI_ABORT was invoked on......." #594