Closed vhientran closed 6 months ago
I also got the warning below many times:
[deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
It looks like the classic error of the new version of huggingface.
--bf16
pip install git+https://github.com/fe1ixxu/ALMA.git@hf-install
Thank you very much for your quick reply! I will try it and report the results to you. Thank you!
It doesn't work since the CUDA out of memory will appear. Thus, I decrease per_device_train_batch_size to 2 and gradient_accumulation_steps to 1. Then, it can run well but it will take long time to complete the pretraining stage. Anyway, many thanks for your reply!
Hi @fe1ixxu , Thank you for releasing the source code of your great work! I tried to reproduce your experiments. like pretraining the LLaMA-2 7B, using the file runs/mono_ft.sh . My local machine is 8*A100, where each A100 device has 40GB. However, after running a few hours, the pretraining process is brokend down due to OVERFLOW error as below:
What should I do to solve this problem? Many thanks for your help!