InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.91k stars 305 forks source link

CUDA out of memory #88

Open RickMeow opened 1 year ago

RickMeow commented 1 year ago

Describe CUDA out of memory. I'm fine-tuning the llama-2-70B using 3 sets of machines containing 8*A100s (40GB)=24*A100(40GB), and this error reported at first seemed like it should be an out-of-memory issue, but a large enough amount of memory has been used in the calculations.

To Reproduce

  1. pip install xtuner
  2. I replaced the model address of the huggingface in llama2_70b_qlora_open_platypus_e1.py with the Llama-2-70b-hf downloaded locally:
    # model
    pretrained_model_name_or_path = '/mnt/model/Llama-2-70b-hf'
    # and also the dataset
    data_path = '/mnt/model/Open-Platypus'
  3. Master(A100*8): NPROC_PER_NODE=8 NNODES=3 NODE_RANK=0 PORT=34545 ADDR=192.168.0.6 xtuner train llama2_70b_qlora_open_platypus_e1
    (and [A100*8]NODE_RANK=1,[A100*8]NODE_RANK=2)

System info

ERROR record

model = MMDistributedDataParallel(torch.cudatorch.cuda
..torch.cuda  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/model/wrappers/distributed.py", line 93, in __init__
OutOfMemoryErrorOutOfMemoryError.: : OutOfMemoryErrorCUDA out of memory. Tried to allocate 2.11 GiB (GPU 6; 39.45 GiB total capacity; 37.14 GiB already allocated; 1.60 GiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 2.11 GiB (GPU 4; 39.45 GiB total capacity; 37.14 GiB already allocated; 1.60 GiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF: 

CUDA out of memory. Tried to allocate 2.11 GiB (GPU 5; 39.45 GiB total capacity; 37.14 GiB already allocated; 1.60 GiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF    
super().__init__(module=module, **kwargs)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 7; 39.45 GiB total capacity; 37.14 GiB already allocated; 1.60 GiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2120836) of binary: /mnt/anaconda/envs/xtuner/bin/python
Traceback (most recent call last):
  File "/mnt/anaconda/envs/xtuner/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2120837)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2120838)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2120839)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 2120840)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2120841)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 2120842)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 2120843)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-02_01:48:53
  host      : gzyd29
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2120836)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
LZHgrla commented 1 year ago

Hi! @RickMeow

You are using DDP, so you need to ensure that each GPU can fully load the Llama2-70b-QLoRA, but this is challenging for 40GB GPU.

We have added a DeepSpeed config of ZeRO2-offload in https://github.com/InternLM/xtuner/pull/94, you can copy this json config and try it by adding --deepspeed PATH_TO_JSON_CFG for all commands.

Welcome your feedback!

RickMeow commented 1 year ago

Hi! @RickMeow

You are using DDP, so you need to ensure that each GPU can fully load the Llama2-70b-QLoRA, but this is challenging for 40GB GPU.

  • If no optimization is added, it will require approximately 55GB memory per GPU.
  • If ZeRO2 is used (--deepspeed deepspeed_zero2), it will require approximately 52GB memory per GPU.
  • If ZeRO2-offload is used (--deepspeed deepspeed_zero2_offload), less GPU memory will be used since it will offload the optimizer memory from the GPU to the host CPU. With this optimization, it may meet the training requirements on a 40GB GPU. However, I currently do not have a 40GB A100 on hand, so I am unable to provide an accurate value.

We have added a DeepSpeed config of ZeRO2-offload in #94, you can copy this json config and try it by adding --deepspeed PATH_TO_JSON_CFG for all commands.

Welcome your feedback!

Thank you for your professional response!

  1. My previous calculations only took into account the overall memory of the machine, which does seem incorrect.

  2. Thank you very much for uploading the DeepSpeed configuration for ZeRO2-offload, I went to test it right away, but I still got the OOM error, here are (a) Command Used and (b) Error Report

(a) Command Used

NPROC_PER_NODE=8 NNODES=3 NODE_RANK=0 PORT=34545 ADDR=192.168.0.6 xtuner train llama2_70b_qlora_open_platypus_e1 --deepspeed /mnt/git/xtuner/xtuner/configs/deepspeed/deepspeed_zero2_offload.json

And the settings for the other two groups of machines: NODE_RANK=1,2

(b) Error Report

……
Runtime environment:
    launcher: pytorch
    randomness: {'seed': None, 'deterministic': False}
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: None
    deterministic: False
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 24
……
Map: 100%|███████████████████████| 24926/24926 [00:10<00:00, 2403.99 examples/s]
quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.68s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.70s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.69s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.69s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.69s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.68s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.71s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00,  4.71s/it]
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/model/fast_forward/__init__.py:18: UserWarning: Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
  warnings.warn(
……
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
    param_applied = fn(param)
    param_applied = fn(param)  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>

  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
    param_applied = fn(param)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
    return self._apply(lambda t: t.half() if t.is_floating_point() else t)
    return self._apply(lambda t: t.half() if t.is_floating_point() else t)
torch.cuda    .return self._apply(lambda t: t.half() if t.is_floating_point() else t)    
return self._apply(lambda t: t.half() if t.is_floating_point() else t)OutOfMemoryErrortorch.cuda
    : .return self._apply(lambda t: t.half() if t.is_floating_point() else t)torch.cudaOutOfMemoryErrorCUDA out of memory. Tried to allocate 500.00 MiB (GPU 5; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
.: 
OutOfMemoryError    CUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF: return self._apply(lambda t: t.half() if t.is_floating_point() else t)        torch.cuda

CUDA out of memory. Tried to allocate 500.00 MiB (GPU 0; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFreturn self._apply(lambda t: t.half() if t.is_floating_point() else t)return self._apply(lambda t: t.half() if t.is_floating_point() else t).
OutOfMemoryError

: torch.cuda.OutOfMemoryErrorCUDA out of memory. Tried to allocate 500.00 MiB (GPU 2; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF: torch.cuda
torch.cudaCUDA out of memory. Tried to allocate 500.00 MiB (GPU 1; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF..
OutOfMemoryErrorOutOfMemoryError: : CUDA out of memory. Tried to allocate 500.00 MiB (GPU 4; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 500.00 MiB (GPU 6; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 7; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2718135) of binary: /mnt/anaconda/envs/xtuner/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 307.6121425628662 seconds
Traceback (most recent call last):
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 920, in _exit_barrier
    store_util.barrier(
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/mnt/anaconda/envs/xtuner/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
  1. After the memory error in ZeRO2-offload, I tested two other ZeRO3 configurations (using the same commands as before but only updating the address of the deepspeed file) and still got OOM, so I'll provide a report of the test as feedback.

deepspeed_zero3.json:

Traceback (most recent call last):
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 225, in <module>
    main()
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 221, in main
    runner.train()
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1180, in train
    self.strategy.prepare(
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in prepare
    model = self.build_model(model)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model
    model = MODELS.build(model)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/model/sft.py", line 35, in __init__
    self._prepare_for_lora(peft_model, use_gradient_checkpointing)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/model/sft.py", line 64, in _prepare_for_lora
    self.llm = get_peft_model(self.llm, self.lora)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/mapping.py", line 106, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/peft_model.py", line 889, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/peft_model.py", line 111, in __init__
    self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/lora.py", line 274, in __init__
    super().__init__(model, config, adapter_name)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 88, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 219, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optionnal_kwargs)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/lora.py", line 373, in _create_and_replace
    self._replace_module(parent, target_name, new_module, target)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/lora.py", line 390, in _replace_module
    module.to(child.weight.device)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 39.45 GiB total capacity; 36.97 GiB already allocated; 20.31 MiB free; 37.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

deepspeed_zero3_offload.json:

……
Traceback (most recent call last):
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 196, in _wrap_model
    return self._apply(lambda t: t.half() if t.is_floating_point() else t)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 225, in <module>
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    param_applied = fn(param)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
    engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(    
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
      File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
      File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
engine = DeepSpeedEngine(args=args,
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 261, in __init__
    return self._apply(lambda t: t.half() if t.is_floating_point() else t)
    module._apply(fn)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    torch.cudamodule._apply(fn).
OutOfMemoryError  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    : self._configure_distributed_model(model)Traceback (most recent call last):

CUDA out of memory. Tried to allocate 500.00 MiB (GPU 4; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF      File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1066, in _configure_distributed_model
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 225, in <module>

module._apply(fn)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
      File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
module._apply(fn)
  [Previous line repeated 2 more times]
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
    main()
      File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 221, in main
param_applied = fn(param)
    param_applied = fn(param)  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>

  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
    runner.train()
  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1180, in train
        return self._apply(lambda t: t.half() if t.is_floating_point() else t)self.module.half()

  File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in half
torch.cuda.    OutOfMemoryErrorreturn self._apply(lambda t: t.half() if t.is_floating_point() else t):     
return self._apply(lambda t: t.half() if t.is_floating_point() else t)CUDA out of memory. Tried to allocate 500.00 MiB (GPU 0; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

torch.cuda.OutOfMemoryErrortorch.cuda: .OutOfMemoryErrorCUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF: 
CUDA out of memory. Tried to allocate 500.00 MiB (GPU 1; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
……
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED

Thanks again, I'll follow up on this project!

LZHgrla commented 1 year ago

It seems that 40GB memory is not enough for 70B-QLoRA, even with deepspeed_zero2_offload.

You can also try to reduce the length of the each samples by setting the max_length to 512 in your config. It will further reduce the memory requirements.

image

Additionally, it is worth noting that when using ZeRO3 with QLoRA, the frozen model will not be split, so it will hardly bring any memory optimization. ZeRO2-offload is already the most extreme configuration possible.

amulil commented 1 year ago

It seems that 40GB memory is not enough for 70B-QLoRA, even with deepspeed_zero2_offload.

You can also try to reduce the length of the each samples by setting the max_length to 512 in your config. It will further reduce the memory requirements.

image

Additionally, it is worth noting that when using ZeRO3 with QLoRA, the frozen model will not be split, so it will hardly bring any memory optimization. ZeRO2-offload is already the most extreme configuration possible.

It seems only using ZeRO3, the single A100-40GGPU's memory is enough for LLama2-70B.

If I only use ZeRO3, compared with ZeRO2-offload with QLoRA, will it make the performance of the model deteriorate?

LZHgrla commented 1 year ago

It seems that 40GB memory is not enough for 70B-QLoRA, even with deepspeed_zero2_offload. You can also try to reduce the length of the each samples by setting the max_length to 512 in your config. It will further reduce the memory requirements.

image

Additionally, it is worth noting that when using ZeRO3 with QLoRA, the frozen model will not be split, so it will hardly bring any memory optimization. ZeRO2-offload is already the most extreme configuration possible.

It seems only using ZeRO3, the single A100-40GGPU's memory is enough for LLama2-70B.

If I only use ZeRO3, compared with ZeRO2-offload with QLoRA, will it make the performance of the model deteriorate?

A single 40GB GPU cannot perform 70B fine-tuning.

Regarding deepspeed, I think that adjusting the training configuration (rather than training hyperparameters) will not result in a significant performance gap.