Open 7flash opened 9 months ago
How many gpus are you training with when you getT this error?
I only have one GPU in my machine on runpod A100:80GB
Are you trying to use deepspeed? If you are using deepspeed on a single gpu machine it will raise this error
I guess yes that must be an issue, since deepspeed is present in example configuration, and I was not even aware what deepspeed is.
EDIT I solved this issue by simply pip install mpi4py, is there any downside to add it to the requierments file? Didn’t read the error message properly
I wouldn't add mpi4py to the requirements as it doesn't cleanly install on a lot of configurations
Is there any solution on this?
Found a solution:
conda install gcc_linux-64 gxx_linux-64
conda install -c conda-forge mpich
And then:
pip install mpi4py
Worked for me on Runpod using winglian/axolotl-cloud:main-latest
Thanks @d42me. I'll see if I can integrate that into the docker image unless you have time to submit a pr
I can do it until the weekend 👍
For me, adding --use_deepspeed
to the accelerate
command avoids this error on single GPU.
accelerate launch --use_deepspeed -m axolotl.cli.train ...
The MPI discovery is not used by default (so no need to install mpi4py
) but deepspeed will try it when the environment variables like RANK
, LOCAL_RANK
, WORLD_SIZE
, etc. are not configured.
Looks like in the accelerate launch CLI here, if --use_deepspeed
is not specified, it will use simple_launcher
instead of deepspeed_launcher
. The environment variables are configured only in deepspeed_launcher
.
For me, adding
--use_deepspeed
to theaccelerate
command avoids this error on single GPU.accelerate launch --use_deepspeed -m axolotl.cli.train ...
The MPI discovery is not used by default (so no need to install
mpi4py
) but deepspeed will try it when the environment variables likeRANK
,LOCAL_RANK
,WORLD_SIZE
, etc. are not configured.Looks like in the accelerate launch CLI here, if
--use_deepspeed
is not specified, it will usesimple_launcher
instead ofdeepspeed_launcher
. The environment variables are configured only indeepspeed_launcher
.
Worked for me. I was using --deepspeed deepspeed_configs/zero1.json , changing to -use_deepspeed worked
Please check that this issue hasn't been reported before.
Expected Behavior
Training mixtral with axolotl
Current behaviour
Shows an error
If I try
pip install mpi4py
it shows this errorSteps to reproduce
start a machine with Dockerfile*
run axolotl mixtral
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
Python 3.9.16
axolotl branch-commit
main
Acknowledgements