FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.2k stars 788 forks source link

Same error happend when running FEDAVG under STANDALONE just clone the repository without any modification! #34

Closed iuserea closed 4 years ago

iuserea commented 4 years ago

When I use commands as below which included in readme.md, ( nohup sh run_fedavg_standalone_pytorch.sh 2 10 64 cifar10 ./../../../data/cifar10 resnet56 homo 200 20 0.001 > ./fedavg_standalone.txt 2>&1 nohup sh run_fedavg_standalone_pytorch.sh 2 10 10 mnist ./../../../data/mnist lr hetero 200 20 0.03 > ./fedavg_standalone.txt 2>&1 &)

The same error occurs: Traceback (most recent call last): File "./main_fedavg.py", line 160, in trainer = FedAvgTrainer(dataset, model, device, args) File "/home/xx/proj/Source/FedML/fedml_api/standalone/fedavg/fedavg_trainer.py", line 22, in init self.model_global.train() AttributeError: 'NoneType' object has no attribute 'train'.

For my poor understanding of PyTorch,can anyone teach me where the code is written not so well?

chaoyanghe commented 4 years ago

@iuserea Hi, I think your local code is in an older version. Please update to the latest version. I just double-checked, it works now.

chaoyanghe commented 4 years ago

@iuserea check the latest code please.

iuserea commented 4 years ago

There're still wrong when running under standalone environment with using following command.

nohup sh run_fedavg_standalone_pytorch.sh 2 10 64 cifar10 ./../../../data/cifar10 resnet56 homo 200 20 0.001 > ./fedavg_standalone.txt 2>&1 &

image

chaoyanghe commented 4 years ago

@iuserea

We suggest using the distributed computing when training large DNN like ResNet since the standalone version is very time-consuming. So we remove the model initialization at create_model() - main_fedavg.py in previous version.

Now I added back large DNN models for standalone. You can choose any as you like if you can accept a very long training time...

iuserea commented 4 years ago

@chaoyanghe Thank you for your kindness for interpretation.I'll have a try at least once.