Open RickMeow opened 1 year ago
Hi! @RickMeow
You are using DDP, so you need to ensure that each GPU can fully load the Llama2-70b-QLoRA, but this is challenging for 40GB GPU.
--deepspeed deepspeed_zero2
), it will require approximately 52GB memory per GPU.--deepspeed deepspeed_zero2_offload
), less GPU memory will be used since it will offload the optimizer memory from the GPU to the host CPU. With this optimization, it may meet the training requirements on a 40GB GPU. However, I currently do not have a 40GB A100 on hand, so I am unable to provide an accurate value.We have added a DeepSpeed config of ZeRO2-offload in https://github.com/InternLM/xtuner/pull/94, you can copy this json config and try it by adding --deepspeed PATH_TO_JSON_CFG
for all commands.
Welcome your feedback!
Hi! @RickMeow
You are using DDP, so you need to ensure that each GPU can fully load the Llama2-70b-QLoRA, but this is challenging for 40GB GPU.
- If no optimization is added, it will require approximately 55GB memory per GPU.
- If ZeRO2 is used (
--deepspeed deepspeed_zero2
), it will require approximately 52GB memory per GPU.- If ZeRO2-offload is used (
--deepspeed deepspeed_zero2_offload
), less GPU memory will be used since it will offload the optimizer memory from the GPU to the host CPU. With this optimization, it may meet the training requirements on a 40GB GPU. However, I currently do not have a 40GB A100 on hand, so I am unable to provide an accurate value.We have added a DeepSpeed config of ZeRO2-offload in #94, you can copy this json config and try it by adding
--deepspeed PATH_TO_JSON_CFG
for all commands.Welcome your feedback!
Thank you for your professional response!
My previous calculations only took into account the overall memory of the machine, which does seem incorrect.
Thank you very much for uploading the DeepSpeed configuration for ZeRO2-offload, I went to test it right away, but I still got the OOM error, here are (a) Command Used
and (b) Error Report
(a) Command Used
NPROC_PER_NODE=8 NNODES=3 NODE_RANK=0 PORT=34545 ADDR=192.168.0.6 xtuner train llama2_70b_qlora_open_platypus_e1 --deepspeed /mnt/git/xtuner/xtuner/configs/deepspeed/deepspeed_zero2_offload.json
And the settings for the other two groups of machines: NODE_RANK=1,2
(b) Error Report
……
Runtime environment:
launcher: pytorch
randomness: {'seed': None, 'deterministic': False}
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 24
……
Map: 100%|███████████████████████| 24926/24926 [00:10<00:00, 2403.99 examples/s]
quantization_config convert to <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.68s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.70s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.69s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.69s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.69s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.68s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.71s/it]
Loading checkpoint shards: 100%|████████████████| 15/15 [01:10<00:00, 4.71s/it]
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/model/fast_forward/__init__.py:18: UserWarning: Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
warnings.warn(
……
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
param_applied = fn(param)
param_applied = fn(param) File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
param_applied = fn(param)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
torch.cuda .return self._apply(lambda t: t.half() if t.is_floating_point() else t)
return self._apply(lambda t: t.half() if t.is_floating_point() else t)OutOfMemoryErrortorch.cuda
: .return self._apply(lambda t: t.half() if t.is_floating_point() else t)torch.cudaOutOfMemoryErrorCUDA out of memory. Tried to allocate 500.00 MiB (GPU 5; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
.:
OutOfMemoryError CUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF: return self._apply(lambda t: t.half() if t.is_floating_point() else t) torch.cuda
CUDA out of memory. Tried to allocate 500.00 MiB (GPU 0; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFreturn self._apply(lambda t: t.half() if t.is_floating_point() else t)return self._apply(lambda t: t.half() if t.is_floating_point() else t).
OutOfMemoryError
: torch.cuda.OutOfMemoryErrorCUDA out of memory. Tried to allocate 500.00 MiB (GPU 2; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF: torch.cuda
torch.cudaCUDA out of memory. Tried to allocate 500.00 MiB (GPU 1; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF..
OutOfMemoryErrorOutOfMemoryError: : CUDA out of memory. Tried to allocate 500.00 MiB (GPU 4; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 500.00 MiB (GPU 6; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 7; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2718135) of binary: /mnt/anaconda/envs/xtuner/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 307.6121425628662 seconds
Traceback (most recent call last):
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 920, in _exit_barrier
store_util.barrier(
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
File "/mnt/anaconda/envs/xtuner/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
deepspeed_zero3.json:
Traceback (most recent call last):
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 225, in <module>
main()
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 221, in main
runner.train()
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1180, in train
self.strategy.prepare(
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 176, in prepare
model = self.build_model(model)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 306, in build_model
model = MODELS.build(model)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/model/sft.py", line 35, in __init__
self._prepare_for_lora(peft_model, use_gradient_checkpointing)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/model/sft.py", line 64, in _prepare_for_lora
self.llm = get_peft_model(self.llm, self.lora)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/mapping.py", line 106, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/peft_model.py", line 889, in __init__
super().__init__(model, peft_config, adapter_name)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/peft_model.py", line 111, in __init__
self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/lora.py", line 274, in __init__
super().__init__(model, config, adapter_name)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 88, in __init__
self.inject_adapter(self.model, adapter_name)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 219, in inject_adapter
self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optionnal_kwargs)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/lora.py", line 373, in _create_and_replace
self._replace_module(parent, target_name, new_module, target)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/peft/tuners/lora.py", line 390, in _replace_module
module.to(child.weight.device)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 39.45 GiB total capacity; 36.97 GiB already allocated; 20.31 MiB free; 37.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
deepspeed_zero3_offload.json:
……
Traceback (most recent call last):
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 196, in _wrap_model
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 225, in <module>
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
param_applied = fn(param)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
engine = DeepSpeedEngine(args=args,
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 261, in __init__
return self._apply(lambda t: t.half() if t.is_floating_point() else t)
module._apply(fn)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
torch.cudamodule._apply(fn).
OutOfMemoryError File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
: self._configure_distributed_model(model)Traceback (most recent call last):
CUDA out of memory. Tried to allocate 500.00 MiB (GPU 4; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1066, in _configure_distributed_model
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 225, in <module>
module._apply(fn)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
main()
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 221, in main
param_applied = fn(param)
param_applied = fn(param) File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in <lambda>
runner.train()
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1180, in train
return self._apply(lambda t: t.half() if t.is_floating_point() else t)self.module.half()
File "/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1001, in half
torch.cuda. OutOfMemoryErrorreturn self._apply(lambda t: t.half() if t.is_floating_point() else t):
return self._apply(lambda t: t.half() if t.is_floating_point() else t)CUDA out of memory. Tried to allocate 500.00 MiB (GPU 0; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryErrortorch.cuda: .OutOfMemoryErrorCUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF:
CUDA out of memory. Tried to allocate 500.00 MiB (GPU 1; 39.45 GiB total capacity; 37.14 GiB already allocated; 231.31 MiB free; 37.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
……
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/mnt/anaconda/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
Thanks again, I'll follow up on this project!
It seems that 40GB memory is not enough for 70B-QLoRA, even with deepspeed_zero2_offload
.
You can also try to reduce the length of the each samples by setting the max_length
to 512 in your config. It will further reduce the memory requirements.
Additionally, it is worth noting that when using ZeRO3 with QLoRA, the frozen model will not be split, so it will hardly bring any memory optimization. ZeRO2-offload is already the most extreme configuration possible.
It seems that 40GB memory is not enough for 70B-QLoRA, even with
deepspeed_zero2_offload
.You can also try to reduce the length of the each samples by setting the
max_length
to 512 in your config. It will further reduce the memory requirements.Additionally, it is worth noting that when using ZeRO3 with QLoRA, the frozen model will not be split, so it will hardly bring any memory optimization. ZeRO2-offload is already the most extreme configuration possible.
It seems only using ZeRO3, the single A100-40GGPU's memory is enough for LLama2-70B.
If I only use ZeRO3, compared with ZeRO2-offload with QLoRA, will it make the performance of the model deteriorate?
It seems that 40GB memory is not enough for 70B-QLoRA, even with
deepspeed_zero2_offload
. You can also try to reduce the length of the each samples by setting themax_length
to 512 in your config. It will further reduce the memory requirements.Additionally, it is worth noting that when using ZeRO3 with QLoRA, the frozen model will not be split, so it will hardly bring any memory optimization. ZeRO2-offload is already the most extreme configuration possible.
It seems only using ZeRO3, the single A100-40GGPU's memory is enough for LLama2-70B.
If I only use ZeRO3, compared with ZeRO2-offload with QLoRA, will it make the performance of the model deteriorate?
A single 40GB GPU cannot perform 70B fine-tuning.
Regarding deepspeed, I think that adjusting the training configuration (rather than training hyperparameters) will not result in a significant performance gap.
Describe CUDA out of memory. I'm fine-tuning the llama-2-70B using 3 sets of machines containing 8*A100s (40GB)=24*A100(40GB), and this error reported at first seemed like it should be an out-of-memory issue, but a large enough amount of memory has been used in the calculations.
To Reproduce
System info
ERROR record