Out of Memory Issue with Blending for 14B Base Model

sigridjineth commented 5 months ago

Description

I am currently attempting to blend models using "OrionStarAI/Orion-14B-Base" as the base model, with blending operations targeting "beomi/OPEN-SOLAR-KO-10.7B" and "beomi/Yi-Ko-6B". During these operations, I am encountering an Out of Memory (OOM) issue.

Environment

Hardware: 8x Nvidia A100 GPUs

It seems peculiar that I'm running into CUDA OOM errors given the hardware capacity, especially when attempting to work with the 14B model. Has anyone successfully attempted to blend with the 14B base model without encountering memory issues?

I would appreciate insights into whether there are specific configurations or optimizations, possibly involving the management of logits values or the general representation of memory values on the GPU, that could help in mitigating these memory-related challenges.

Can you check the jupyter notebook below to debug?

https://drive.google.com/file/d/1ROj4F_FWsdaF6QGlEI2arMnBJ5P2xtWE/view?usp=sharing

Deepspeed Command

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:50"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
# --include localhost:0,3,4,5,6,7 
# --exclude=localhost:1,2
!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/240313/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --logging_strategy steps \
  --do_train \
  --do_distill \
  --bf16 True \
  --tf32 False \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240313_dataset_2" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --num_train_epochs 1 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-6 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing False \
  --use_flash_attn True \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 1 \
  --report_to wandb \
  --remove_unused_columns False \
  --safe_serialization False

Error Stack

기존 데이터셋에 대해 0.00001% 추출했는데도 동일 OOM이 발생하네요.

GPU 1번
 File "/home/sionic/.cache/huggingface/modules/transformers_modules/OrionStarAI/Orion-14B-Base/87d96b1852d58c4f605f86e8437d47ab7ec89e1d/modeling_orion.py", line 599, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sionic/.cache/huggingface/modules/transformers_modules/OrionStarAI/Orion-14B-Base/87d96b1852d58c4f605f86e8437d47ab7ec89e1d/modeling_orion.py", line 337, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 1 has a total capacty of 79.15 GiB of which 255.31 MiB is free. Including non-PyTorch memory, this process has 78.89 GiB memory in use. Of the allocated memory 77.45 GiB is allocated by PyTorch, and 21.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[

GPU 7번

File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sionic/.cache/huggingface/modules/transformers_modules/OrionStarAI/Orion-14B-Base/87d96b1852d58c4f605f86e8437d47ab7ec89e1d/modeling_orion.py", line 353, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 1858, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 640.00 MiB. GPU 7 has a total capacty of 79.15 GiB of which 399.31 MiB is free. Including non-PyTorch memory, this process has 78.75 GiB memory in use. Of the allocated memory 77.45 GiB is allocated by PyTorch, and 21.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for any assistance or suggestions you might provide.

18907305772 commented 5 months ago

I used Llama-2-13B as the base model on 8x40G A100 GPUs, and changed --deepspeed config/zero_stage2_config.json to --deepspeed config/zero_stage3_config.json. You can try using ZERO3 to solve OOM Error. Here is the zero_stage3_config.json file.

{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

If you are using ZERO3, you need to run zero_to_fp32.py. This Python script can be found in the directory where the model was saved. The script helps to convert the saved model weights after the training process.

sigridjineth commented 5 months ago

@18907305772 hello, thanks for your prompt response! I have difficulties in saving tensors at the last stage of training with the error message stating that Object of type Tensor is not JSON serializable ...

{'loss': 2.1677, 'grad_norm': tensor(0.3639, device='cuda:0'), 'learning_rate': 1.8518518518518518e-08, 'epoch': 0.99}
wandb: WARNING (User provided step: 13824 is less than current step: 13825. Dropping entry: {'Train/Samples/train_loss': 2.299144983291626, '_timestamp': 1710404279.4219217}).
100%|██████████| 110/110 [1:02:04<00:00, 33.86s/it]
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 110, in train
    trainer.save_state()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer_pt_utils.py", line 1045, in save_state
    self.state.save_to_json(path)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
    json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
  File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Tensor is not JSON serializable
{'loss': 2.1581, 'grad_norm': tensor(0.3962, device='cuda:0'), 'learning_rate': 9.259259259259259e-09, 'epoch': 1.0}
{'train_runtime': 3724.1605, 'train_samples_per_second': 3.794, 'train_steps_per_second': 0.03, 'train_loss': 2.1972951780666006, 'epoch': 1.0}
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     2.1973
  train_runtime            = 1:02:04.16
  train_samples_per_second =      3.794
  train_steps_per_second   =       0.03
wandb: - 0.070 MB of 0.070 MB uploaded
wandb: Run history:
wandb:               Train/Samples/lr ▁███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:       Train/Samples/train_loss ▁
wandb:                    train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:              train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:                train/grad_norm ██▇▇▆▆▅▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▁▁▁
wandb:            train/learning_rate ▁███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:                     train/loss ▆█▆▆▅▆▄▄▄▅▁▂▄▄▄▂▃▁▃▄▃▅▃▃▃▄▂▃▃▂▂▂���▂▃▃▂▄▂▁
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb: 
wandb: Run summary:
wandb:               Train/Samples/lr 0.0
wandb:       Train/Samples/train_loss 2.2144
wandb:                    train/epoch 1.0
wandb:              train/global_step 110
wandb:                train/grad_norm 0.39623
wandb:            train/learning_rate 0.0
wandb:                     train/loss 2.1581
wandb:               train/total_flos 2.4335449625880166e+18
wandb:               train/train_loss 2.1973
wandb:            train/train_runtime 3724.1605
wandb: train/train_samples_per_second 3.794
wandb:   train/train_steps_per_second 0.03
wandb: 
wandb: 🚀 View run boysenberry-cake-96 at: https://wandb.ai/academickhu/fusellm/runs/otkcja47
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: [./wandb/run-20240314_151440-otkcja47/logs](https://vscode-remote+ssh-002dremote-002balibaba.vscode-resource.vscode-cdn.net/home/sionic/sigrid/wandb/run-20240314_151440-otkcja47/logs)
wandb: WARNING (User provided step: 13952 is less than current step: 13953. Dropping entry: {'Train/Samples/train_loss': 2.162116765975952, '_timestamp': 1710404313.0872948}).
[2024-03-14 16:18:52,125] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3814010) of binary: [/home/sionic/.venv/bin/python](https://vscode-remote+ssh-002dremote-002balibaba.vscode-resource.vscode-cdn.net/home/sionic/.venv/bin/python)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
[./FuseLLM/FuseLLM/src/train.py](https://vscode-remote+ssh-002dremote-002balibaba.vscode-resource.vscode-cdn.net/home/sionic/sigrid/FuseLLM/FuseLLM/src/train.py) FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-14_16:18:52
  host      : iZmj7ir0ircgij46j89st9Z
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3814010)
  error_file: <N/A>
  traceback : To enable traceback see: [https://pyto](https://pytorch.org/docs/stable/elastic/errors.html)

It seems like the train.py has an issue of converting tensors into JSON - can you think of any workarounds for this?

My command was the following which is not to use deepspeed

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:50'
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"  # Use only GPU 0

!python -m torch.distributed.launch --nproc_per_node=8 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/240313/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --logging_strategy steps \
  --do_train \
  --do_distill \
  --bf16 True \
  --tf32 False \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240313_packing_set_with_valid" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_train_epochs 1 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-6 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn True \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 1 \
  --report_to wandb \
  --remove_unused_columns False \
  --safe_serialization False

18907305772 commented 5 months ago

There might be an issue with saving the grad_norm. I also found the same issue in the transformers project, and it appears that they have resolved it. You can try updating to the latest development version of transformers. Alternatively, downgrading to transformers==4.36 might also fix the issue. Here's a reference link (https://github.com/huggingface/transformers/pull/29568).

18907305772 / FuseAI