FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.18k stars 788 forks source link

Unable to run MNIST experiments. run_fedavg_distributed_pytorch.sh #187

Open xlw686 opened 2 years ago

xlw686 commented 2 years ago

I am experimenting with the tutorial below

chaoyanghe commented 2 years ago

@xlw686 you set the worker number as 10, which may be out of your local machine's memory?

xlw686 commented 2 years ago

I changed the worker number to 1,and then run the shell below:

 sh run_fedavg_distributed_pytorch.sh 1000 1 lr hetero 200 1 10 0.03 mnist "./../../../data/mnist" sgd 0

Below is an error message:

(fedml) root@VM-24-3-ubuntu:~/share/FedML/fedml_experiments/distributed/fedavg# sh run_fedavg_distributed_pytorch.sh 1000 1 lr hetero 200 1 10 0.03 mnist "./../../../data/mnist" sgd 0
2
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
/root/anaconda3/envs/fedml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 144 from C header, got 152 from PyObject
  return f(*args, **kwds)
Traceback (most recent call last):
  File "./main_fedavg.py", line 43, in <module>
    from fedml_api.distributed.fedavg.FedAvgAPI import FedML_init, FedML_FedAvg_distributed
  File "/root/share/FedML/fedml_api/distributed/fedavg/FedAvgAPI.py", line 1, in <module>
    from mpi4py import MPI
  File "/root/share/mpi4py.py", line 1, in <module>
    from mpi4py import MPI
ImportError: cannot import name 'MPI' from 'mpi4py' (/root/share/mpi4py.py)
Traceback (most recent call last):
  File "./main_fedavg.py", line 43, in <module>
    from fedml_api.distributed.fedavg.FedAvgAPI import FedML_init, FedML_FedAvg_distributed
  File "/root/share/FedML/fedml_api/distributed/fedavg/FedAvgAPI.py", line 1, in <module>
    from mpi4py import MPI
  File "/root/share/mpi4py.py", line 1, in <module>
    from mpi4py import MPI
ImportError: cannot import name 'MPI' from 'mpi4py' (/root/share/mpi4py.py)
xlw686 commented 2 years ago

The display cannot import MPI from mpi4py, but I can do it like the following:

ImportError: cannot import name 'MPI' from 'mpi4py' (/root/share/mpi4py.py)
(fedml) root@VM-24-3-ubuntu:~/share/FedML/fedml_experiments/distributed/fedavg# python
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mpi4py
>>> from mpi4py import MPI
>>> comm = MPI.COMM_WORLD
>>> process_id = comm.Get_rank()
>>> print(process_id)
0
>>> 

I don't know what went wrong😂

chaoyanghe commented 2 years ago

the worker number should be at least 3.

xlw686 commented 2 years ago

The worker number changed to 3, which is no different:

 sh run_fedavg_distributed_pytorch.sh 1000 3 lr hetero 200 1 10 0.03 mnist "./../../../data/mnist" sgd 0

Below is an error message:

Traceback (most recent call last):
  File "./main_fedavg.py", line 43, in <module>
    from fedml_api.distributed.fedavg.FedAvgAPI import FedML_init, FedML_FedAvg_distributed
  File "/root/share/FedML/fedml_api/distributed/fedavg/FedAvgAPI.py", line 1, in <module>
    from mpi4py import MPI
  File "/root/share/mpi4py.py", line 1, in <module>
    from mpi4py import MPI
ImportError: cannot import name 'MPI' from 'mpi4py' (/root/share/mpi4py.py)

the worker number should be at least 3.

chaoyanghe commented 2 years ago

@xlw686 is this issue solved in the latest version?

fedml-dimitris commented 1 year ago

@xlw686 Can you run your example using the latest dev branch?