Please check that this issue hasn't been reported before.
[X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Capable of DPO training for 7B models in A6000-48G memory environment
Current behaviour
I used an A6000 (48GiB) * 8 device for DPO training based on the 7B model, and during the training, there was OOM behavior. For the official training of dolphin-2.6-microl-7b-dpo using dolphin-dpo.yml, I have tried to reduce the graphics memory through parameters, but OOM still occurs。
1、The comparison of YML is as follows:
Please check that this issue hasn't been reported before.
Expected Behavior
Capable of DPO training for 7B models in A6000-48G memory environment
Current behaviour
I used an A6000 (48GiB) * 8 device for DPO training based on the 7B model, and during the training, there was OOM behavior. For the official training of dolphin-2.6-microl-7b-dpo using dolphin-dpo.yml, I have tried to reduce the graphics memory through parameters, but OOM still occurs。 1、The comparison of YML is as follows:
1)Official DPO training using YML(https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo/blob/main/configs/dolphin-dpo.yml)
2)The YMl I am using
2、Screenshot of video memory usage:
3、OOM Error screenshot:
4、Deepspeed zero configuration used:
Steps to reproduce
docker:winglian/axolotl:main-py3.10-cu118-2.0.1 Start Training: export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256 export TOKENIZERS_PARALLELISM="true" accelerate launch -m axolotl.cli.train dpo.yml
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements