Open stayt1 opened 2 years ago
Hello, do you solve this problem? I meet it as well :(
Hello, do you solve this problem? I meet it as well :(
not yet. I shoot a mail to you.
This is due to MPI.COMM_WORLD.Abort()
in FedMLCommManager.finish()
.
change it to self.com_manager.stop_receive_message()
like other backends. This will require you to manually kill mpi processes or you can use abort for server process to gracefully end the mpi processes.
To save the parameters place torch.save
before self.finish()
in func handle_message_receive_model_from_client(self, msg_params)
of FedOptServerManager
.
@stayt1 Hi for FedNLP project, we only support distributed computing with MPI and real cross-silo training. We will retire SP and NCCL soon.
You can save the checking point at test_all() in python/app/fednlp/text_classification/trainer/classification_aggregator.py
. For the MPI.COMM_WORLD.Abort() issue. You can temporarily address it as @sylee0124 suggested. In our latest version, we will address it with a handshaking protocol.
Hi Chaoyang! @chaoyanghe @yuchenlin I try to reproduce some results in FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. I followed this instruction step by step. I met some troubles: Backend: when using default hyperparameters in
text_classificationconfig/fedml_config_mpi.yaml
The data of each client is learned and the loss can be reduced to low. But at the end of the aggregation, there will be and an fatal error:backend is
MPI
, it will output:as the same with https://github.com/FedML-AI/FedNLP/issues/35 This will lead to kill the MPI and can not save the model.
sp
, it will output:backend is
NCCL
, it will output:I try to git clone the previous fednlp repository, and then git rolling back the fednlp and fedml packages to the two months ago version, which still has the problem that the model trained by MPI_ABORT cannot be [saved],(https://github.com/FedML-AI/FedNLP/issues/35) which is also the same error.
I would appreciate your help. Thanks.