axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.08k stars 774 forks source link

ValueError When Training CodeLLama Example on 2x 4090s #1039

Closed briandw closed 6 months ago

briandw commented 6 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

I am trying to run the code llama example from examples/code-llama/7b/qlora.yml Using a dual 4090 setup. I launch it with NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml

Expect it to complete training.

Current behaviour

I get the following error:

ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Full output errorlog.txt

Steps to reproduce

Ubuntu 22.04, NVIDIA-SMI 545.23.08, Driver Version: 545.23.08 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 2x rtx4090

Fetch the image: docker pull winglian/axolotl:main-py3.10-cu121-2.1.1

Run docker: sudo docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=volume,src=axolotl,target=${HOME}/workspace/axolotl -v ${HOME}/workspace/axolotl:/workspace /axolotl/shared -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu121-2.1.1

NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml

Config yaml

examples/code-llama/7b/qlora.yml from the repo

Possible solution

No response

Which Operating Systems are you using?

Python Version

Python 3.10.13 from the docker image

axolotl branch-commit

main/74532dd

Acknowledgements

briandw commented 6 months ago

Looks like I was missing the deepspeed configuration. Launching it this way, NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml --deepspeed deepspeed/zero1.json

seems to have fixed the problem.