fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
395 stars 29 forks source link

Got Errors when pretraining LLaMA-2 on Monolingual Dataset #33

Closed vhientran closed 6 months ago

vhientran commented 6 months ago

Hi @fe1ixxu , Thank you for releasing the source code of your great work! I tried to reproduce your experiments. like pretraining the LLaMA-2 7B, using the file runs/mono_ft.sh . My local machine is 8*A100, where each A100 device has 40GB. However, after running a few hours, the pretraining process is brokend down due to OVERFLOW error as below:

File "/vhtran/miniconda3/envs/env_alma_2024/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Traceback (most recent call last):
  File "run_llmmt.py", line 227, in <module>
    main()
File "/vhtran/miniconda3/envs/env_alma_2024/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_llmmt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1402889)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1402890)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1402891)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 1402892)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 1402893)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 1402894)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 1402895)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-12_15:25:37
  host      : recgpu121
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1402888)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

What should I do to solve this problem? Many thanks for your help!

vhientran commented 6 months ago

I also got the warning below many times: [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648

fe1ixxu commented 6 months ago

It looks like the classic error of the new version of huggingface.

  1. Try to use --bf16
  2. Or uninstall transfomers and reinstall it by pip install git+https://github.com/fe1ixxu/ALMA.git@hf-install
vhientran commented 6 months ago

Thank you very much for your quick reply! I will try it and report the results to you. Thank you!

vhientran commented 6 months ago

It doesn't work since the CUDA out of memory will appear. Thus, I decrease per_device_train_batch_size to 2 and gradient_accumulation_steps to 1. Then, it can run well but it will take long time to complete the pretraining stage. Anyway, many thanks for your reply!