Open casper-hansen opened 9 months ago
I have faced hang issues after 1:30 hours training time wiht ft and zero3
With the same config I get OOM while training on 5 x nodes with 8 x H100 each.
Any configs other than the example 4-bit qlora I have tried results in a OOM or some other error.
[2023-12-18 00:52:30,840] [ERROR] [axolotl.load_model:453] [PID:99] [RANK:7] CUDA out of memory. Tried to allocate 112.00 MiB (GPU 7; 79.11 GiB total capacity; 78.12 GiB already allocated; 40.62 MiB free; 78.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Cody
I have faced hang issues after 1:30 hours training time wiht ft and zero3
same question
I have faced hang issues after 1:30 hours training time wiht ft and zero3
same question
u can try to update nccl 2.19.3
Any updates on this error? I am seeing the same thing with Llama-v2 full finetune using zero3.
I think this was solved by setting bf16 from auto to true instead in your deepspeed config
Does anyone still have this issue after trying casper's suggestion?
Please check that this issue hasn't been reported before.
Expected Behavior
That the model can start training after the DeepSpeed fix on main.
Current behaviour
The model loads and does not OOM, but DeepSpeed raises an assertion on checking that the datatype is the same for all tensors:
Traceback
Steps to reproduce
Reuse the config that I have provided and load the model on 8x A100.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.8
axolotl branch-commit
main
Acknowledgements