axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.83k stars 863 forks source link

ModuleNotFoundError: No module named 'mpi4py' using single GPU with deepspeed #1211

Open 7flash opened 9 months ago

7flash commented 9 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Training mixtral with axolotl

Current behaviour

Shows an error

  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 677, in mpi_discovery
    from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.9/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.9/bin/python3', '-m', 'axolotl.cli.train', 'axolotl/examples/mistral/mixtral.yml']' returned non-zero exit status 1.

If I try pip install mpi4py it shows this error

      /root/miniconda3/envs/py3.9/compiler_compat/ld: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so: undefined reference to `opal_list_t_class'
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      error: Cannot link MPI programs. Check your configuration!!!
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for mpi4py
Failed to build mpi4py
ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects

Steps to reproduce

  1. start a machine with Dockerfile*

  2. run axolotl mixtral

accelerate launch -m axolotl.cli.train axolotl/examples/mistral/mixtral.yml
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

WORKDIR /

RUN mkdir /workspace

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND noninteractive\
    SHELL=/bin/bash
RUN apt-get update --yes && \
    # - apt-get upgrade is run to patch known vulnerabilities in apt-get packages as
    #   the ubuntu base image is rebuilt too seldom sometimes (less than once a month)
    apt-get upgrade --yes && \
    apt install --yes --no-install-recommends\
    git\
    wget\
    curl\
    bash\
    software-properties-common\
    openssh-server
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt install python3.10 -y --no-install-recommends && \
    ln -s /usr/bin/python3.10 /usr/bin/python && \
    rm /usr/bin/python3 && \
    ln -s /usr/bin/python3.10 /usr/bin/python3 && \
    apt-get clean && rm -rf /var/lib/apt/lists/* && \
    echo "en_US.UTF-8 UTF-8" > /etc/locale.gen
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python get-pip.py
RUN pip install --no-cache-dir --pre torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/nightly/cu118
RUN pip install --no-cache-dir -U jupyterlab ipywidgets jupyter-archive
# RUN jupyter nbextension enable --py widgetsnbextension
RUN jupyter labextension disable "@jupyterlab/apputils-extension:announcements"

ADD start.sh /

RUN chmod +x /start.sh

CMD [ "/start.sh" ]

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Python Version

Python 3.9.16

axolotl branch-commit

main

Acknowledgements

winglian commented 9 months ago

How many gpus are you training with when you getT this error?

7flash commented 9 months ago

I only have one GPU in my machine on runpod A100:80GB

winglian commented 9 months ago

Are you trying to use deepspeed? If you are using deepspeed on a single gpu machine it will raise this error

7flash commented 9 months ago

I guess yes that must be an issue, since deepspeed is present in example configuration, and I was not even aware what deepspeed is.

JohanWork commented 9 months ago

EDIT I solved this issue by simply pip install mpi4py, is there any downside to add it to the requierments file? Didn’t read the error message properly

winglian commented 9 months ago

I wouldn't add mpi4py to the requirements as it doesn't cleanly install on a lot of configurations

d42me commented 9 months ago

Is there any solution on this?

d42me commented 9 months ago

Found a solution:

conda install gcc_linux-64 gxx_linux-64
conda install -c conda-forge mpich

And then: pip install mpi4py

Worked for me on Runpod using winglian/axolotl-cloud:main-latest

winglian commented 9 months ago

Thanks @d42me. I'll see if I can integrate that into the docker image unless you have time to submit a pr

d42me commented 9 months ago

I can do it until the weekend 👍

qiuosier commented 8 months ago

For me, adding --use_deepspeed to the accelerate command avoids this error on single GPU.

accelerate launch --use_deepspeed -m axolotl.cli.train ...

The MPI discovery is not used by default (so no need to install mpi4py) but deepspeed will try it when the environment variables like RANK, LOCAL_RANK, WORLD_SIZE, etc. are not configured.

Looks like in the accelerate launch CLI here, if --use_deepspeed is not specified, it will use simple_launcher instead of deepspeed_launcher. The environment variables are configured only in deepspeed_launcher.

monk1337 commented 8 months ago

For me, adding --use_deepspeed to the accelerate command avoids this error on single GPU.

accelerate launch --use_deepspeed -m axolotl.cli.train ...

The MPI discovery is not used by default (so no need to install mpi4py) but deepspeed will try it when the environment variables like RANK, LOCAL_RANK, WORLD_SIZE, etc. are not configured.

Looks like in the accelerate launch CLI here, if --use_deepspeed is not specified, it will use simple_launcher instead of deepspeed_launcher. The environment variables are configured only in deepspeed_launcher.

Worked for me. I was using --deepspeed deepspeed_configs/zero1.json , changing to -use_deepspeed worked