axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
8.03k stars 887 forks source link

ModuleNotFoundError: No module named 'mpi4py' using single GPU with deepspeed #1211

Open 7flash opened 10 months ago

7flash commented 10 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Training mixtral with axolotl

Current behaviour

Shows an error

  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 677, in mpi_discovery
    from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.9/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.9/bin/python3', '-m', 'axolotl.cli.train', 'axolotl/examples/mistral/mixtral.yml']' returned non-zero exit status 1.

If I try pip install mpi4py it shows this error

      /root/miniconda3/envs/py3.9/compiler_compat/ld: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so: undefined reference to `opal_list_t_class'
      collect2: error: ld returned 1 exit status
      failure.
      removing: _configtest.c _configtest.o
      error: Cannot link MPI programs. Check your configuration!!!
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for mpi4py
Failed to build mpi4py
ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects

Steps to reproduce

  1. start a machine with Dockerfile*

  2. run axolotl mixtral

accelerate launch -m axolotl.cli.train axolotl/examples/mistral/mixtral.yml
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

WORKDIR /

RUN mkdir /workspace

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND noninteractive\
    SHELL=/bin/bash
RUN apt-get update --yes && \
    # - apt-get upgrade is run to patch known vulnerabilities in apt-get packages as
    #   the ubuntu base image is rebuilt too seldom sometimes (less than once a month)
    apt-get upgrade --yes && \
    apt install --yes --no-install-recommends\
    git\
    wget\
    curl\
    bash\
    software-properties-common\
    openssh-server
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt install python3.10 -y --no-install-recommends && \
    ln -s /usr/bin/python3.10 /usr/bin/python && \
    rm /usr/bin/python3 && \
    ln -s /usr/bin/python3.10 /usr/bin/python3 && \
    apt-get clean && rm -rf /var/lib/apt/lists/* && \
    echo "en_US.UTF-8 UTF-8" > /etc/locale.gen
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python get-pip.py
RUN pip install --no-cache-dir --pre torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/nightly/cu118
RUN pip install --no-cache-dir -U jupyterlab ipywidgets jupyter-archive
# RUN jupyter nbextension enable --py widgetsnbextension
RUN jupyter labextension disable "@jupyterlab/apputils-extension:announcements"

ADD start.sh /

RUN chmod +x /start.sh

CMD [ "/start.sh" ]

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Python Version

Python 3.9.16

axolotl branch-commit

main

Acknowledgements

winglian commented 10 months ago

How many gpus are you training with when you getT this error?

7flash commented 10 months ago

I only have one GPU in my machine on runpod A100:80GB

winglian commented 10 months ago

Are you trying to use deepspeed? If you are using deepspeed on a single gpu machine it will raise this error

7flash commented 10 months ago

I guess yes that must be an issue, since deepspeed is present in example configuration, and I was not even aware what deepspeed is.

JohanWork commented 10 months ago

EDIT I solved this issue by simply pip install mpi4py, is there any downside to add it to the requierments file? Didn’t read the error message properly

winglian commented 10 months ago

I wouldn't add mpi4py to the requirements as it doesn't cleanly install on a lot of configurations

d42me commented 10 months ago

Is there any solution on this?

d42me commented 10 months ago

Found a solution:

conda install gcc_linux-64 gxx_linux-64
conda install -c conda-forge mpich

And then: pip install mpi4py

Worked for me on Runpod using winglian/axolotl-cloud:main-latest

winglian commented 10 months ago

Thanks @d42me. I'll see if I can integrate that into the docker image unless you have time to submit a pr

d42me commented 10 months ago

I can do it until the weekend 👍

qiuosier commented 9 months ago

For me, adding --use_deepspeed to the accelerate command avoids this error on single GPU.

accelerate launch --use_deepspeed -m axolotl.cli.train ...

The MPI discovery is not used by default (so no need to install mpi4py) but deepspeed will try it when the environment variables like RANK, LOCAL_RANK, WORLD_SIZE, etc. are not configured.

Looks like in the accelerate launch CLI here, if --use_deepspeed is not specified, it will use simple_launcher instead of deepspeed_launcher. The environment variables are configured only in deepspeed_launcher.

monk1337 commented 9 months ago

For me, adding --use_deepspeed to the accelerate command avoids this error on single GPU.

accelerate launch --use_deepspeed -m axolotl.cli.train ...

The MPI discovery is not used by default (so no need to install mpi4py) but deepspeed will try it when the environment variables like RANK, LOCAL_RANK, WORLD_SIZE, etc. are not configured.

Looks like in the accelerate launch CLI here, if --use_deepspeed is not specified, it will use simple_launcher instead of deepspeed_launcher. The environment variables are configured only in deepspeed_launcher.

Worked for me. I was using --deepspeed deepspeed_configs/zero1.json , changing to -use_deepspeed worked