Closed briandw closed 6 months ago
Looks like I was missing the deepspeed configuration.
Launching it this way,
NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml --deepspeed deepspeed/zero1.json
seems to have fixed the problem.
Please check that this issue hasn't been reported before.
Expected Behavior
I am trying to run the code llama example from
examples/code-llama/7b/qlora.yml
Using a dual 4090 setup. I launch it with NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.ymlExpect it to complete training.
Current behaviour
I get the following error:
Full output errorlog.txt
Steps to reproduce
Ubuntu 22.04, NVIDIA-SMI 545.23.08, Driver Version: 545.23.08 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 2x rtx4090
Fetch the image: docker pull winglian/axolotl:main-py3.10-cu121-2.1.1
Run docker: sudo docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=volume,src=axolotl,target=${HOME}/workspace/axolotl -v ${HOME}/workspace/axolotl:/workspace /axolotl/shared -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu121-2.1.1
NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml
Config yaml
examples/code-llama/7b/qlora.yml from the repo
Possible solution
No response
Which Operating Systems are you using?
Python Version
Python 3.10.13 from the docker image
axolotl branch-commit
main/74532dd
Acknowledgements