pro训练时爆OOM - Githubissues

AlibabaResearch / DAMO-ConvAI

DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.

MIT License

1.08k stars 176 forks source link

pro训练时爆OOM #135

Open Zheng-Jay opened 3 months ago

Zheng-Jay commented 3 months ago

你好我跑PRO训练代码会报OOM，我是80G的A800，训练13B的模型，按道理应该不会爆啊我把batch size设为1，block_size设为100，还是爆了，不知道问题出在哪？ train_hh.sh：

export OMP_NUM_THREADS=16
root_dir=..

#stage 23
id=$1
data_path=$2
ranking_len=$3
mkdir -p $root_dir/logs/$id/$ranking_len
    # --main_process_port 29534 \
CUDA_VISIBLE_DEVICES=4,5,7 accelerate launch --num_processes 2 --config_file ds_config.yaml --main_process_port=29534 main.py \
    --task hh \
    --train_file_path $root_dir/data/${data_path} \
    --validation_file_path $root_dir/data/hh_dev \
    --validation_file_name sampled_dev.json \
    --output_dir $root_dir/checkpoints/index_$id/stage_$ranking_len \
    --log_path $root_dir/logs/$id/$ranking_len \
    --index $id \
    --seed 42 \
    --temperature 1 \
    --sft_weight 0.05 \
    --num_train_epochs 2 \
    --training_stage_num $ranking_len \
    --block_size 100 \
    --learning_rate 5e-6 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --model_name_or_path /mnt/data2/finLLM/models/tigerbot-13b-base \
    --do_train \
    --do_validation > $root_dir/logs/$id/$ranking_len/train_detail.log 2>&1

日志：

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:541: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:541: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:541: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")

task                                    : hh
do_train                                : True
do_validation                           : True
sft_weight                              : 0.05
index                                   : exp001
seed                                    : 42
temperature                             : 1.0
training_stage_num                      : 2
train_file_path                         : ../data/hh_train_len2
validation_file_path                    : ../data/hh_dev
validation_file_name                    : sampled_dev.json
model_name_or_path                      : /mnt/data2/finLLM/models/tigerbot-13b-base
per_device_train_batch_size             : 1
per_device_eval_batch_size              : 1
learning_rate                           : 5e-06
block_size                              : 100
num_train_epochs                        : 2
max_train_steps                         : None
gradient_accumulation_steps             : 8
output_dir                              : ../checkpoints/index_exp001/stage_2
checkpointing_step                      : 600
log_path                                : ../logs/exp001/2

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:19<00:39, 19.66s/it]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:21<00:42, 21.21s/it]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:21<00:43, 21.95s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:35<00:17, 17.49s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:42<00:21, 21.40s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:44<00:22, 22.51s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:46<00:00, 14.24s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:46<00:00, 15.34s/it]

Loading checkpoint shards: 100%|██████████| 3/3 [00:57<00:00, 18.19s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:57<00:00, 19.11s/it]

Loading checkpoint shards: 100%|██████████| 3/3 [00:58<00:00, 18.65s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:58<00:00, 19.57s/it]
[info] args:
Namespace(task='hh', do_train=True, do_validation=True, sft_weight=0.05, index='exp001', seed=42, temperature=1.0, training_stage_num=2, train_file_path='../data/hh_train_len2', validation_file_path='../data/hh_dev', validation_file_name='sampled_dev.json', model_name_or_path='/mnt/data2/finLLM/models/tigerbot-13b-base', per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=5e-06, block_size=100, num_train_epochs=2, max_train_steps=None, gradient_accumulation_steps=8, output_dir='../checkpoints/index_exp001/stage_2', checkpointing_step=600, log_path='../logs/exp001/2')
[info] args:
Namespace(task='hh', do_train=True, do_validation=True, sft_weight=0.05, index='exp001', seed=42, temperature=1.0, training_stage_num=2, train_file_path='../data/hh_train_len2', validation_file_path='../data/hh_dev', validation_file_name='sampled_dev.json', model_name_or_path='/mnt/data2/finLLM/models/tigerbot-13b-base', per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=5e-06, block_size=100, num_train_epochs=2, max_train_steps=None, gradient_accumulation_steps=8, output_dir='../checkpoints/index_exp001/stage_2', checkpointing_step=600, log_path='../logs/exp001/2')[info] args:
Namespace(task='hh', do_train=True, do_validation=True, sft_weight=0.05, index='exp001', seed=42, temperature=1.0, training_stage_num=2, train_file_path='../data/hh_train_len2', validation_file_path='../data/hh_dev', validation_file_name='sampled_dev.json', model_name_or_path='/mnt/data2/finLLM/models/tigerbot-13b-base', per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=5e-06, block_size=100, num_train_epochs=2, max_train_steps=None, gradient_accumulation_steps=8, output_dir='../checkpoints/index_exp001/stage_2', checkpointing_step=600, log_path='../logs/exp001/2')

[2024-03-20 02:13:02,161] [INFO] [logging.py:75:log_dist] [Rank -1] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2024-03-20 02:13:02,271] [INFO] [logging.py:75:log_dist] [Rank -1] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2024-03-20 02:13:02,299] [INFO] [logging.py:75:log_dist] [Rank -1] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2024-03-20 02:14:17,786] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-20 02:14:17,787] [INFO] [logging.py:75:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2024-03-20 02:14:17,787] [INFO] [logging.py:75:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-03-20 02:14:17,831] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-03-20 02:14:17,831] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-03-20 02:14:17,832] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:145:__init__] Reduce bucket size 500,000,000
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:146:__init__] Allgather bucket size 500,000,000
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:147:__init__] CPU Offload: False
[2024-03-20 02:14:17,832] [INFO] [stage_1_and_2.py:148:__init__] Round robin gradient partitioning: False
Using /home/zhengmingjie/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/zhengmingjie/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/zhengmingjie/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/zhengmingjie/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.7705831527709961 seconds
Loading extension module utils...
Time to load utils op: 0.7298614978790283 seconds
Loading extension module utils...
Time to load utils op: 0.7109990119934082 seconds
Rank: 0 partition count [3] and sizes[(4435952640, False)] 
Rank: 2 partition count [3] and sizes[(4435952640, False)] 
Rank: 1 partition count [3] and sizes[(4435952640, False)] 
[2024-03-20 02:15:34,752] [INFO] [utils.py:826:see_memory_usage] Before initializing optimizer states
[2024-03-20 02:15:34,753] [INFO] [utils.py:827:see_memory_usage] MA 41.35 GB         Max_MA 49.61 GB         CA 49.62 GB         Max_CA 50 GB 
[2024-03-20 02:15:34,754] [INFO] [utils.py:835:see_memory_usage] CPU Virtual Memory:  used = 98.89 GB, percent = 9.8%
Traceback (most recent call last):
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 182, in train
    model, optimizer, dataset_length = self.init_prepare_train(
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 165, in init_prepare_train
    model, optimizer, _ = self.accelerator.prepare(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1090, in prepare
    result = self._prepare_deepspeed(*args)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1292, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1542, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 524, in __init__
    self.initialize_optimizer_states()
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 649, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 160, in step
    self._init_group(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 114, in _init_group
    state["exp_avg"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 57.88 GiB already allocated; 5.21 GiB free; 57.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 182, in train
    model, optimizer, dataset_length = self.init_prepare_train(
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 165, in init_prepare_train
    model, optimizer, _ = self.accelerator.prepare(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1090, in prepare
    result = self._prepare_deepspeed(*args)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1292, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1542, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 524, in __init__
    self.initialize_optimizer_states()
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 649, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 160, in step
    self._init_group(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 118, in _init_group
    state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.13 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 182, in train
    model, optimizer, dataset_length = self.init_prepare_train(
  File "/mnt/data2/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 165, in init_prepare_train
    model, optimizer, _ = self.accelerator.prepare(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1090, in prepare
    result = self._prepare_deepspeed(*args)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1368, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 336, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1292, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1542, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 524, in __init__
    self.initialize_optimizer_states()
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 649, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 160, in step
    self._init_group(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/optim/adamw.py", line 118, in _init_group
    state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.18 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92834 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 92835 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 92833) of binary: /mnt/data2/miniconda3/envs/pro/bin/python
Traceback (most recent call last):
  File "/mnt/data2/miniconda3/envs/pro/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/commands/launch.py", line 900, in launch_command
    deepspeed_launcher(args)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher
    distrib_run.run(args)
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/data2/miniconda3/envs/pro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_02:15:40
  host      : oem
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 92833)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

F2-Song commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错
如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

Zheng-Jay commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：

import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()

GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：

Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1

F2-Song commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1

您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：

if accelerator. wait_for_everyone():
    exit()

并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。

如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？

Zheng-Jay commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。

如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？

对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：

        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()

在3个GPU上运行，结果3个GPU都在prepare()爆了

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

超出的显存大小跟设置前一样，应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation，爆了，将deepspeed改为zero3，显示版本过低，更新deepspeed： Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了，换回zero2也不会，看样子是deepspeed版本的问题，这是为什么？但是在train loop爆了，将gpu增加到8张，还是爆了：

  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

zero3配置如下：

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

原本是上了cpu offload的，但是deepspeed报错强制要求使用官方的优化器，而不是pro代码里的torch优化器，但我不知道怎么改... 不过话说回来，bs=1，block_size 512，模型是13B，这个配置对显存要求应该不高呀，是不是还有什么库的版本不对？

F2-Song commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？

对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：

        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()

在3个GPU上运行，结果3个GPU都在prepare()爆了

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

zero3配置如下：

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

感谢你的详细描述，以下是逐条回复：

优化器定义在process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改，但我也不太了解这一特性（比如使用其他优化器）是否受accelerate支持。（现在使用的AdamW确实也比较占显存）
代码实现里没有直接用deepspeed的高级特性，因此应该对package版本不敏感，只要升级后能成功运行就可以。
对您提到的bs=1，block_size 512，模型是13B这一设置，因我目前没有能使用的8卡机器，没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练（等同于SFT）。我们之前有一种简单的换算，比如，bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。
很抱歉，在 #69 中我对do_validation的描述不够清晰。这个选项本身和显存无关，而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此，假定您在使用一台8卡机器，关闭do_validation后需要将sh脚本的--num_processes 7也修改一下，比如修改为8。若仅在ds_config.yaml中修改是无效的，因为在命令中直接指定的值会更优先（当然在关掉do_validation后，于命令中直接删掉--num_processes 7，之后通过ds_config.yaml控制用卡数量也可以）。
对您提到升级deepspeed后能顺利通过prepare，这个问题的原因我也不太了解。我注意到您使用的tigerbot-13b-base是基于transformers 4.31.0的，而在后续版本中transformers确实修改了llama的实现。因此，您或可尝试将所有package都升级至比较新的版本，如上所述，PRO的代码实现应该是对具体版本不敏感的，只要package之间互相能兼容就可以。

Zheng-Jay commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？
对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：
        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()
在3个GPU上运行，结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样，应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation，爆了，将deepspeed改为zero3，显示版本过低，更新deepspeed： Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了，换回zero2也不会，看样子是deepspeed版本的问题，这是为什么？但是在train loop爆了，将gpu增加到8张，还是爆了：
  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下：
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的，但是deepspeed报错强制要求使用官方的优化器，而不是pro代码里的torch优化器，但我不知道怎么改... 不过话说回来，bs=1，block_size 512，模型是13B，这个配置对显存要求应该不高呀，是不是还有什么库的版本不对？
感谢你的详细描述，以下是逐条回复：

优化器定义在process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改，但我也不太了解这一特性（比如使用其他优化器）是否受accelerate支持。（现在使用的AdamW确实也比较占显存）

代码实现里没有直接用deepspeed的高级特性，因此应该对package版本不敏感，只要升级后能成功运行就可以。

对您提到的bs=1，block_size 512，模型是13B这一设置，因我目前没有能使用的8卡机器，没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练（等同于SFT）。我们之前有一种简单的换算，比如，bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。

很抱歉，在 pro #69 中我对do_validation的描述不够清晰。这个选项本身和显存无关，而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此，假定您在使用一台8卡机器，关闭do_validation后需要将sh脚本的--num_processes 7也修改一下，比如修改为8。若仅在ds_config.yaml中修改是无效的，因为在命令中直接指定的值会更优先（当然在关掉do_validation后，于命令中直接删掉--num_processes 7，之后通过ds_config.yaml控制用卡数量也可以）。

对您提到升级deepspeed后能顺利通过prepare，这个问题的原因我也不太了解。我注意到您使用的tigerbot-13b-base是基于transformers 4.31.0的，而在后续版本中transformers确实修改了llama的实现。因此，您或可尝试将所有package都升级至比较新的版本，如上所述，PRO的代码实现应该是对具体版本不敏感的，只要package之间互相能兼容就可以。

感谢您的后续跟进和建议。 1、我后面有注意到do_validation的实际功能，去掉该参数后，我有将--num_processes设置为8，将最后一张卡也利用起来。

2、减小ranking_len

按照您的建议，从2调为1，可以跑，显存使用量77G/80G，这不太合理，我之前预训练同样的13B模型，block_size 512，per_device_train_batch_size可以设为64

升级torch版本，跑同样的程序，显存使用量66G/80G，使用量有减少，但调为2还是爆了

3、升级库升级transformers，没用，索性将全部库都更新一遍，还是没用。

4、更换小模型

更换成一个1.3B的小模型，可以跑。

目前还找不到原因，后面要看能不能申请更多gpu来调试了...

F2-Song commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？
对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：
        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()
在3个GPU上运行，结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样，应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation，爆了，将deepspeed改为zero3，显示版本过低，更新deepspeed： Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了，换回zero2也不会，看样子是deepspeed版本的问题，这是为什么？但是在train loop爆了，将gpu增加到8张，还是爆了：
  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下：
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的，但是deepspeed报错强制要求使用官方的优化器，而不是pro代码里的torch优化器，但我不知道怎么改... 不过话说回来，bs=1，block_size 512，模型是13B，这个配置对显存要求应该不高呀，是不是还有什么库的版本不对？
感谢你的详细描述，以下是逐条回复：

优化器定义在process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改，但我也不太了解这一特性（比如使用其他优化器）是否受accelerate支持。（现在使用的AdamW确实也比较占显存）

代码实现里没有直接用deepspeed的高级特性，因此应该对package版本不敏感，只要升级后能成功运行就可以。

对您提到的bs=1，block_size 512，模型是13B这一设置，因我目前没有能使用的8卡机器，没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练（等同于SFT）。我们之前有一种简单的换算，比如，bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。

很抱歉，在 pro #69 中我对do_validation的描述不够清晰。这个选项本身和显存无关，而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此，假定您在使用一台8卡机器，关闭do_validation后需要将sh脚本的--num_processes 7也修改一下，比如修改为8。若仅在ds_config.yaml中修改是无效的，因为在命令中直接指定的值会更优先（当然在关掉do_validation后，于命令中直接删掉--num_processes 7，之后通过ds_config.yaml控制用卡数量也可以）。

对您提到升级deepspeed后能顺利通过prepare，这个问题的原因我也不太了解。我注意到您使用的tigerbot-13b-base是基于transformers 4.31.0的，而在后续版本中transformers确实修改了llama的实现。因此，您或可尝试将所有package都升级至比较新的版本，如上所述，PRO的代码实现应该是对具体版本不敏感的，只要package之间互相能兼容就可以。
感谢您的后续跟进和建议。 1、我后面有注意到do_validation的实际功能，去掉该参数后，我有将--num_processes设置为8，将最后一张卡也利用起来。

2、减小ranking_len

按照您的建议，从2调为1，可以跑，显存使用量77G/80G，这不太合理，我之前预训练同样的13B模型，block_size 512，per_device_train_batch_size可以设为64

升级torch版本，跑同样的程序，显存使用量66G/80G，使用量有减少，但调为2还是爆了

3、升级库升级transformers，没用，索性将全部库都更新一遍，还是没用。

4、更换小模型

更换成一个1.3B的小模型，可以跑。

目前还找不到原因，后面要看能不能申请更多gpu来调试了...

也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是，per_device_train_batch_size=64这个设置，确实是很惊讶可以开这么大，是在peft设置下或者使用了量化模型吗？

Zheng-Jay commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？
对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：
        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()
在3个GPU上运行，结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样，应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation，爆了，将deepspeed改为zero3，显示版本过低，更新deepspeed： Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了，换回zero2也不会，看样子是deepspeed版本的问题，这是为什么？但是在train loop爆了，将gpu增加到8张，还是爆了：
  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下：
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的，但是deepspeed报错强制要求使用官方的优化器，而不是pro代码里的torch优化器，但我不知道怎么改... 不过话说回来，bs=1，block_size 512，模型是13B，这个配置对显存要求应该不高呀，是不是还有什么库的版本不对？
感谢你的详细描述，以下是逐条回复：

优化器定义在process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改，但我也不太了解这一特性（比如使用其他优化器）是否受accelerate支持。（现在使用的AdamW确实也比较占显存）

代码实现里没有直接用deepspeed的高级特性，因此应该对package版本不敏感，只要升级后能成功运行就可以。

对您提到的bs=1，block_size 512，模型是13B这一设置，因我目前没有能使用的8卡机器，没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练（等同于SFT）。我们之前有一种简单的换算，比如，bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。

很抱歉，在 pro #69 中我对do_validation的描述不够清晰。这个选项本身和显存无关，而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此，假定您在使用一台8卡机器，关闭do_validation后需要将sh脚本的--num_processes 7也修改一下，比如修改为8。若仅在ds_config.yaml中修改是无效的，因为在命令中直接指定的值会更优先（当然在关掉do_validation后，于命令中直接删掉--num_processes 7，之后通过ds_config.yaml控制用卡数量也可以）。

对您提到升级deepspeed后能顺利通过prepare，这个问题的原因我也不太了解。我注意到您使用的tigerbot-13b-base是基于transformers 4.31.0的，而在后续版本中transformers确实修改了llama的实现。因此，您或可尝试将所有package都升级至比较新的版本，如上所述，PRO的代码实现应该是对具体版本不敏感的，只要package之间互相能兼容就可以。
感谢您的后续跟进和建议。 1、我后面有注意到do_validation的实际功能，去掉该参数后，我有将--num_processes设置为8，将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议，从2调为1，可以跑，显存使用量77G/80G，这不太合理，我之前预训练同样的13B模型，block_size 512，per_device_train_batch_size可以设为64 升级torch版本，跑同样的程序，显存使用量66G/80G，使用量有减少，但调为2还是爆了 3、升级库升级transformers，没用，索性将全部库都更新一遍，还是没用。 4、更换小模型更换成一个1.3B的小模型，可以跑。目前还找不到原因，后面要看能不能申请更多gpu来调试了...
也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是，per_device_train_batch_size=64这个设置，确实是很惊讶可以开这么大，是在peft设置下或者使用了量化模型吗？

噢，是用LoRA了

F2-Song commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？
对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：
        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()
在3个GPU上运行，结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样，应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation，爆了，将deepspeed改为zero3，显示版本过低，更新deepspeed： Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了，换回zero2也不会，看样子是deepspeed版本的问题，这是为什么？但是在train loop爆了，将gpu增加到8张，还是爆了：
  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下：
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的，但是deepspeed报错强制要求使用官方的优化器，而不是pro代码里的torch优化器，但我不知道怎么改... 不过话说回来，bs=1，block_size 512，模型是13B，这个配置对显存要求应该不高呀，是不是还有什么库的版本不对？
感谢你的详细描述，以下是逐条回复：

优化器定义在process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改，但我也不太了解这一特性（比如使用其他优化器）是否受accelerate支持。（现在使用的AdamW确实也比较占显存）

代码实现里没有直接用deepspeed的高级特性，因此应该对package版本不敏感，只要升级后能成功运行就可以。

对您提到的bs=1，block_size 512，模型是13B这一设置，因我目前没有能使用的8卡机器，没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练（等同于SFT）。我们之前有一种简单的换算，比如，bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。

很抱歉，在 pro #69 中我对do_validation的描述不够清晰。这个选项本身和显存无关，而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此，假定您在使用一台8卡机器，关闭do_validation后需要将sh脚本的--num_processes 7也修改一下，比如修改为8。若仅在ds_config.yaml中修改是无效的，因为在命令中直接指定的值会更优先（当然在关掉do_validation后，于命令中直接删掉--num_processes 7，之后通过ds_config.yaml控制用卡数量也可以）。

对您提到升级deepspeed后能顺利通过prepare，这个问题的原因我也不太了解。我注意到您使用的tigerbot-13b-base是基于transformers 4.31.0的，而在后续版本中transformers确实修改了llama的实现。因此，您或可尝试将所有package都升级至比较新的版本，如上所述，PRO的代码实现应该是对具体版本不敏感的，只要package之间互相能兼容就可以。
感谢您的后续跟进和建议。 1、我后面有注意到do_validation的实际功能，去掉该参数后，我有将--num_processes设置为8，将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议，从2调为1，可以跑，显存使用量77G/80G，这不太合理，我之前预训练同样的13B模型，block_size 512，per_device_train_batch_size可以设为64 升级torch版本，跑同样的程序，显存使用量66G/80G，使用量有减少，但调为2还是爆了 3、升级库升级transformers，没用，索性将全部库都更新一遍，还是没用。 4、更换小模型更换成一个1.3B的小模型，可以跑。目前还找不到原因，后面要看能不能申请更多gpu来调试了...
也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是，per_device_train_batch_size=64这个设置，确实是很惊讶可以开这么大，是在peft设置下或者使用了量化模型吗？
噢，是用LoRA了

了解，不过用LoRA能开64也很惊讶，可能是使用的卡比较多www。

Zheng-Jay commented 3 months ago

hello，在 #69 中也已回复您。经看这段log，OOM出现在accelerator的prepare处，此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2，即2张卡用于训练LLM，所以问题应不是出在训练过程上（因此与batch_size和block_size无关）。您可以尝试以下方法是否有效：

尝试用一个空白脚本，只包括使用accelerator.prepare来初始化您的LLM checkpoint，观察是否能复现OOM的报错

如果1仍会OOM，说明单卡装不下13B的模型（可能因为zero-2会在每张卡上都放置一份完整的模型参数，比如您设置的模型精度较高，即可能出现OOM），可以尝试改用zero-3和更低精度（可在process_manager.py的init中，直接在from_pretrained里添加dtype）

您好，感谢回复。我尝试您说的用accelerator.prepare来初始化13B的模型，代码如下（用gpt生成的，不知是否有误）：
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

def main():
    # Initialize accelerator
    accelerator = Accelerator()
    path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(path)

    # Load model
    model = AutoModelForCausalLM.from_pretrained(path)

    # Prepare model with accelerator
    model, _, = accelerator.prepare(model, tokenizer)  # Removed the unnecessary unpacking

    # Free up CUDA memory
    torch.cuda.empty_cache()

    # Pause execution for 5 minutes
    print("Model loaded successfully. Pausing execution for 5 minutes.")
    time.sleep(300)  # 300 seconds = 5 minutes

if __name__ == "__main__":
    main()
GPU不会爆OOM，单卡A800占大概50G，另外模型的精度如下： "torch_dtype": "bfloat16", 这个精度似乎是比较常见的？我觉得不是模型过大或者精度较高的原因，之前有在别人的开源训练项目上有用这个模型去训练，也是用zero-2+16位精度，不会出现OOM的问题。另外，不知道库的版本会不会影响？我在部署本项目时，有遇到bug，更新了部分库的版本：
Package                 Version
----------------------- -----------
absl-py                 2.1.0
accelerate              0.17.1
aiohttp                 3.9.3
aiosignal               1.3.1
annotated-types         0.6.0
async-timeout           4.0.3
attrs                   23.2.0
certifi                 2024.2.2
charset-normalizer      2.0.4
click                   8.1.7
datasets                2.18.0
deepspeed               0.8.1
dill                    0.3.6
evaluate                0.4.1
filelock                3.13.1
frozenlist              1.4.1
fsspec                  2024.2.0
grpcio                  1.62.1
hjson                   3.1.0
huggingface-hub         0.21.4
idna                    3.4
importlib_metadata      7.0.2
joblib                  1.3.2
Markdown                3.6
MarkupSafe              2.1.5
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.5
multiprocess            0.70.14
ninja                   1.11.1.1
nltk                    3.8.1
numpy                   1.22.2
packaging               24.0
pandas                  2.0.3
peft                    0.3.0
pillow                  10.2.0
pip                     23.3.1
protobuf                5.26.0
psutil                  5.9.8
py-cpuinfo              9.0.0
pyarrow                 15.0.2
pyarrow-hotfix          0.6
pydantic                1.10.9
pydantic_core           2.16.3
python-dateutil         2.9.0.post0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2023.12.25
requests                2.31.0
responses               0.18.0
rouge-score             0.1.2
scipy                   1.11.1
sentencepiece           0.2.0
setuptools              68.2.2
six                     1.16.0
tensorboard             2.16.2
tensorboard-data-server 0.7.2
tokenizers              0.13.3
torch                   1.13.1
torchaudio              0.13.1
torchvision             0.14.1
tqdm                    4.64.1
transformers            4.28.1
typing_extensions       4.9.0
tzdata                  2024.1
urllib3                 2.1.0
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.4
zipp                    3.18.1
您好，感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed，且指定1张显卡，这样会与直接不使用accelerate在效果上没有区别，所以和PRO的运行环境还不完全一样（因为使用DeepSpeed的话，应该会要求prepare时必须传入dataloader，这也是我们在代码里设置了一个placeholder_dataloader的原因）。您可尝试在PRO的代码process_manager.py中，于accelerator.prepare后直接添加如下代码：
if accelerator. wait_for_everyone():
    exit()
并观察运行至此时是否还会OOM，据此再考虑下一步debug计划。如果bf16的显存占用是50G左右，确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下？
对训练代码不太懂，感谢您的建议，尝试结果如下： 1、直接在process_manager.py中添加您提供的代码：
        model, optimizer, _ = self.accelerator.prepare(
            self.model, optimizer, placeholder_dataloader
        )
        if self.accelerator.wait_for_everyone():
            print("[info] self.accelerator.wait_for_everyone() True")
            exit()
在3个GPU上运行，结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度： self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了：
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样，应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation，爆了，将deepspeed改为zero3，显示版本过低，更新deepspeed： Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了，换回zero2也不会，看样子是deepspeed版本的问题，这是为什么？但是在train loop爆了，将gpu增加到8张，还是爆了：
  0%|          | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
  0%|          | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
    model = process_manager.train()
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
    self.compute_loss(model, batch, print_loss)
  File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
    self.accelerator.backward(total_loss)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下：
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3  # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的，但是deepspeed报错强制要求使用官方的优化器，而不是pro代码里的torch优化器，但我不知道怎么改... 不过话说回来，bs=1，block_size 512，模型是13B，这个配置对显存要求应该不高呀，是不是还有什么库的版本不对？
感谢你的详细描述，以下是逐条回复：

优化器定义在process_manager.py的153行。我们用的版本正是一级目录下requirements.txt里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改，但我也不太了解这一特性（比如使用其他优化器）是否受accelerate支持。（现在使用的AdamW确实也比较占显存）

代码实现里没有直接用deepspeed的高级特性，因此应该对package版本不敏感，只要升级后能成功运行就可以。

对您提到的bs=1，block_size 512，模型是13B这一设置，因我目前没有能使用的8卡机器，没法尝试复现。可以考虑设置sh脚本里ranking_len=1试一下是否能正常训练（等同于SFT）。我们之前有一种简单的换算，比如，bs=1, ranking_len=2和bs=2, ranking_len=1所需资源应该差不多。

很抱歉，在 pro #69 中我对do_validation的描述不够清晰。这个选项本身和显存无关，而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此，假定您在使用一台8卡机器，关闭do_validation后需要将sh脚本的--num_processes 7也修改一下，比如修改为8。若仅在ds_config.yaml中修改是无效的，因为在命令中直接指定的值会更优先（当然在关掉do_validation后，于命令中直接删掉--num_processes 7，之后通过ds_config.yaml控制用卡数量也可以）。

对您提到升级deepspeed后能顺利通过prepare，这个问题的原因我也不太了解。我注意到您使用的tigerbot-13b-base是基于transformers 4.31.0的，而在后续版本中transformers确实修改了llama的实现。因此，您或可尝试将所有package都升级至比较新的版本，如上所述，PRO的代码实现应该是对具体版本不敏感的，只要package之间互相能兼容就可以。
感谢您的后续跟进和建议。 1、我后面有注意到do_validation的实际功能，去掉该参数后，我有将--num_processes设置为8，将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议，从2调为1，可以跑，显存使用量77G/80G，这不太合理，我之前预训练同样的13B模型，block_size 512，per_device_train_batch_size可以设为64 升级torch版本，跑同样的程序，显存使用量66G/80G，使用量有减少，但调为2还是爆了 3、升级库升级transformers，没用，索性将全部库都更新一遍，还是没用。 4、更换小模型更换成一个1.3B的小模型，可以跑。目前还找不到原因，后面要看能不能申请更多gpu来调试了...
也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是，per_device_train_batch_size=64这个设置，确实是很惊讶可以开这么大，是在peft设置下或者使用了量化模型吗？
噢，是用LoRA了
了解，不过用LoRA能开64也很惊讶，可能是使用的卡比较多www。

当时是用6卡训。前辈，有空可否看下邮箱，我有些问题想请教下，给您发了个邮件。