ValueError When Training CodeLLama Example on 2x 4090s

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I am trying to run the code llama example from examples/code-llama/7b/qlora.yml Using a dual 4090 setup. I launch it with NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml

Expect it to complete training.

Current behaviour

I get the following error:

ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

Full output errorlog.txt

Steps to reproduce

Ubuntu 22.04, NVIDIA-SMI 545.23.08, Driver Version: 545.23.08 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0 2x rtx4090

Fetch the image: docker pull winglian/axolotl:main-py3.10-cu121-2.1.1

Run docker: sudo docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=volume,src=axolotl,target=${HOME}/workspace/axolotl -v ${HOME}/workspace/axolotl:/workspace /axolotl/shared -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu121-2.1.1

NCCL_P2P_DISABLE=1 accelerate launch -m axolotl.cli.train examples/code-llama/7b/qlora.yml

Config yaml

examples/code-llama/7b/qlora.yml from the repo

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

Python 3.10.13 from the docker image

axolotl branch-commit

main/74532dd

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

axolotl-ai-cloud / axolotl