Open Zheng-Jay opened 3 months ago
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os
import time
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
def main():
# Initialize accelerator
accelerator = Accelerator()
path = "/mnt/data2/finLLM/models/tigerbot-13b-base"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(path)
# Load model
model = AutoModelForCausalLM.from_pretrained(path)
# Prepare model with accelerator
model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking
# Free up CUDA memory
torch.cuda.empty_cache()
# Pause execution for 5 minutes
print("Model loaded successfully. Pausing execution for 5 minutes.")
time.sleep(300) # 300 seconds = 5 minutes
if __name__ == "__main__":
main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version
----------------------- -----------
absl-py 2.1.0
accelerate 0.17.1
aiohttp 3.9.3
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.2.0
certifi 2024.2.2
charset-normalizer 2.0.4
click 8.1.7
datasets 2.18.0
deepspeed 0.8.1
dill 0.3.6
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2024.2.0
grpcio 1.62.1
hjson 3.1.0
huggingface-hub 0.21.4
idna 3.4
importlib_metadata 7.0.2
joblib 1.3.2
Markdown 3.6
MarkupSafe 2.1.5
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
multidict 6.0.5
multiprocess 0.70.14
ninja 1.11.1.1
nltk 3.8.1
numpy 1.22.2
packaging 24.0
pandas 2.0.3
peft 0.3.0
pillow 10.2.0
pip 23.3.1
protobuf 5.26.0
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pydantic 1.10.9
pydantic_core 2.16.3
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
responses 0.18.0
rouge-score 0.1.2
scipy 1.11.1
sentencepiece 0.2.0
setuptools 68.2.2
six 1.16.0
tensorboard 2.16.2
tensorboard-data-server 0.7.2
tokenizers 0.13.3
torch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1
tqdm 4.64.1
transformers 4.28.1
typing_extensions 4.9.0
tzdata 2024.1
urllib3 2.1.0
Werkzeug 3.0.1
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4
zipp 3.18.1
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone():
exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。
如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。
如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare(
self.model, optimizer, placeholder_dataloader
)
if self.accelerator.wait_for_everyone():
print("[info] self.accelerator.wait_for_everyone() True")
exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s]
Epoch 0 starts
Load training data from ../data/hh_train_len2/train.json
0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last):
File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module>
model = process_manager.train()
File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train
self.compute_loss(model, batch, print_loss)
File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss
self.accelerator.backward(total_loss)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 8
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3 # 改为使用Zero3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
process_manager.py
的153行。我们用的版本正是一级目录下requirements.txt
里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)bs=1,block_size 512,模型是13B
这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1
试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2
和bs=2, ranking_len=1
所需资源应该差不多。do_validation
的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation
后需要将sh脚本的--num_processes 7
也修改一下,比如修改为8。若仅在ds_config.yaml
中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation
后,于命令中直接删掉--num_processes 7
,之后通过ds_config.yaml
控制用卡数量也可以)。tigerbot-13b-base
是基于transformers 4.31.0
的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py
的153行。我们用的版本正是一级目录下requirements.txt
里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B
这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1
试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2
和bs=2, ranking_len=1
所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation
的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation
后需要将sh脚本的--num_processes 7
也修改一下,比如修改为8。若仅在ds_config.yaml
中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation
后,于命令中直接删掉--num_processes 7
,之后通过ds_config.yaml
控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base
是基于transformers 4.31.0
的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。
感谢您的后续跟进和建议。
1、我后面有注意到do_validation
的实际功能,去掉该参数后,我有将--num_processes
设置为8,将最后一张卡也利用起来。
2、减小ranking_len
按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64
升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了
3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。
4、更换小模型
更换成一个1.3B的小模型,可以跑。
目前还找不到原因,后面要看能不能申请更多gpu来调试了...
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py
的153行。我们用的版本正是一级目录下requirements.txt
里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B
这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1
试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2
和bs=2, ranking_len=1
所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation
的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation
后需要将sh脚本的--num_processes 7
也修改一下,比如修改为8。若仅在ds_config.yaml
中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation
后,于命令中直接删掉--num_processes 7
,之后通过ds_config.yaml
控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base
是基于transformers 4.31.0
的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation
的实际功能,去掉该参数后,我有将--num_processes
设置为8,将最后一张卡也利用起来。2、减小ranking_len
按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64
升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了
3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。
4、更换小模型
更换成一个1.3B的小模型,可以跑。
目前还找不到原因,后面要看能不能申请更多gpu来调试了...
也非常感谢您那边的积极反馈~
我自己还有一个好奇的点是,per_device_train_batch_size=64
这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py
的153行。我们用的版本正是一级目录下requirements.txt
里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B
这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1
试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2
和bs=2, ranking_len=1
所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation
的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation
后需要将sh脚本的--num_processes 7
也修改一下,比如修改为8。若仅在ds_config.yaml
中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation
后,于命令中直接删掉--num_processes 7
,之后通过ds_config.yaml
控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base
是基于transformers 4.31.0
的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation
的实际功能,去掉该参数后,我有将--num_processes
设置为8,将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64 升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了 3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。 4、更换小模型 更换成一个1.3B的小模型,可以跑。 目前还找不到原因,后面要看能不能申请更多gpu来调试了...也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是,
per_device_train_batch_size=64
这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?
噢,是用LoRA了
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py
的153行。我们用的版本正是一级目录下requirements.txt
里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B
这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1
试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2
和bs=2, ranking_len=1
所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation
的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation
后需要将sh脚本的--num_processes 7
也修改一下,比如修改为8。若仅在ds_config.yaml
中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation
后,于命令中直接删掉--num_processes 7
,之后通过ds_config.yaml
控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base
是基于transformers 4.31.0
的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation
的实际功能,去掉该参数后,我有将--num_processes
设置为8,将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64 升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了 3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。 4、更换小模型 更换成一个1.3B的小模型,可以跑。 目前还找不到原因,后面要看能不能申请更多gpu来调试了...也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是,
per_device_train_batch_size=64
这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?噢,是用LoRA了
了解,不过用LoRA能开64也很惊讶,可能是使用的卡比较多www。
hello,在 #69 中也已回复您。 经看这段log,OOM出现在accelerator的prepare处,此处是将LLM通过DeepSpeed分发到每张卡上。因看到您在command在设置num_process=2,即2张卡用于训练LLM,所以问题应不是出在训练过程上(因此与batch_size和block_size无关)。您可以尝试以下方法是否有效:
- 尝试用一个空白脚本,只包括使用accelerator.prepare来初始化您的LLM checkpoint,观察是否能复现OOM的报错
- 如果1仍会OOM,说明单卡装不下13B的模型(可能因为zero-2会在每张卡上都放置一份完整的模型参数,比如您设置的模型精度较高,即可能出现OOM),可以尝试改用zero-3和更低精度(可在process_manager.py的init中,直接在from_pretrained里添加dtype)
您好,感谢回复。 我尝试您说的用accelerator.prepare来初始化13B的模型,代码如下(用gpt生成的,不知是否有误):
import os import time os.environ["CUDA_VISIBLE_DEVICES"] = "0" import torch from accelerate import Accelerator from transformers import AutoModelForCausalLM, AutoTokenizer def main(): # Initialize accelerator accelerator = Accelerator() path = "/mnt/data2/finLLM/models/tigerbot-13b-base" # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(path) # Load model model = AutoModelForCausalLM.from_pretrained(path) # Prepare model with accelerator model, _, = accelerator.prepare(model, tokenizer) # Removed the unnecessary unpacking # Free up CUDA memory torch.cuda.empty_cache() # Pause execution for 5 minutes print("Model loaded successfully. Pausing execution for 5 minutes.") time.sleep(300) # 300 seconds = 5 minutes if __name__ == "__main__": main()
GPU不会爆OOM,单卡A800占大概50G,另外模型的精度如下: "torch_dtype": "bfloat16", 这个精度似乎是比较常见的?我觉得不是模型过大或者精度较高的原因,之前有在别人的开源训练项目上有用这个模型去训练,也是用zero-2+16位精度,不会出现OOM的问题。 另外,不知道库的版本会不会影响? 我在部署本项目时,有遇到bug,更新了部分库的版本:
Package Version ----------------------- ----------- absl-py 2.1.0 accelerate 0.17.1 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 certifi 2024.2.2 charset-normalizer 2.0.4 click 8.1.7 datasets 2.18.0 deepspeed 0.8.1 dill 0.3.6 evaluate 0.4.1 filelock 3.13.1 frozenlist 1.4.1 fsspec 2024.2.0 grpcio 1.62.1 hjson 3.1.0 huggingface-hub 0.21.4 idna 3.4 importlib_metadata 7.0.2 joblib 1.3.2 Markdown 3.6 MarkupSafe 2.1.5 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 multidict 6.0.5 multiprocess 0.70.14 ninja 1.11.1.1 nltk 3.8.1 numpy 1.22.2 packaging 24.0 pandas 2.0.3 peft 0.3.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 1.10.9 pydantic_core 2.16.3 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 responses 0.18.0 rouge-score 0.1.2 scipy 1.11.1 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 tokenizers 0.13.3 torch 1.13.1 torchaudio 0.13.1 torchvision 0.14.1 tqdm 4.64.1 transformers 4.28.1 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.1.0 Werkzeug 3.0.1 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1
您好,感谢您提供的详细信息。您回复的这段代码在运行时应该没有使用DeepSpeed,且指定1张显卡,这样会与直接不使用accelerate在效果上没有区别,所以和PRO的运行环境还不完全一样(因为使用DeepSpeed的话,应该会要求prepare时必须传入dataloader,这也是我们在代码里设置了一个placeholder_dataloader的原因)。您可尝试在PRO的代码process_manager.py中,于accelerator.prepare后直接添加如下代码:
if accelerator. wait_for_everyone(): exit()
并观察运行至此时是否还会OOM,据此再考虑下一步debug计划。 如果bf16的显存占用是50G左右,确实不应在prepare阶段就OOM。您或可考虑在AutoModelForCausalLM.from_pretrained()中直接指定torch_dtype=torch.bfloat16试一下?
对训练代码不太懂,感谢您的建议,尝试结果如下: 1、直接在process_manager.py中添加您提供的代码:
model, optimizer, _ = self.accelerator.prepare( self.model, optimizer, placeholder_dataloader ) if self.accelerator.wait_for_everyone(): print("[info] self.accelerator.wait_for_everyone() True") exit()
在3个GPU上运行,结果3个GPU都在prepare()爆了
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2、手动设置精度: self.model = AutoModelForCausalLM.from_pretrained(self.model_path,config=self.model_config, torch_dtype=torch.bfloat16) 3张卡还是爆了:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 2; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.03 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 1; 79.15 GiB total capacity; 74.40 GiB already allocated; 4.02 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.53 GiB (GPU 0; 79.15 GiB total capacity; 74.40 GiB already allocated; 3.99 GiB free; 74.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
超出的显存大小跟设置前一样,应该能排除精度这一因素。 3、关闭do_validation & 更新deepspeed 另外看到#69有提到说关闭do_validation,爆了,将deepspeed改为zero3,显示版本过低,更新deepspeed: Found existing installation: deepspeed 0.8.1 Uninstalling deepspeed-0.8.1: Successfully uninstalled deepspeed-0.8.1 Successfully installed deepspeed-0.14.0 pynvml-11.5.0 更新了之后初始化不会报错了,换回zero2也不会,看样子是deepspeed版本的问题,这是为什么? 但是在train loop爆了,将gpu增加到8张,还是爆了:
0%| | 0/5026 [00:00<?, ?it/s] Epoch 0 starts Load training data from ../data/hh_train_len2/train.json 0%| | 1/5026 [01:13<102:21:28, 73.33s/it]Traceback (most recent call last): File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/main.py", line 45, in <module> model = process_manager.train() File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 261, in train self.compute_loss(model, batch, print_loss) File "/share/home/wenqingchen/finLLM/DAMO-ConvAI-main/PRO/train/utils/process_manager.py", line 111, in compute_loss self.accelerator.backward(total_loss) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/accelerator.py", line 1630, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2051, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/share/home/wenqingchen/miniconda3/envs/pro/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 592.00 MiB (GPU 4; 79.15 GiB total capacity; 76.09 GiB already allocated; 269.31 MiB free; 78.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zero3配置如下:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 8 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 # 改为使用Zero3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true use_cpu: false
原本是上了cpu offload的,但是deepspeed报错强制要求使用官方的优化器,而不是pro代码里的torch优化器,但我不知道怎么改... 不过话说回来,bs=1,block_size 512,模型是13B,这个配置对显存要求应该不高呀,是不是还有什么库的版本不对?
感谢你的详细描述,以下是逐条回复:
- 优化器定义在
process_manager.py
的153行。我们用的版本正是一级目录下requirements.txt
里描述的版本。如果使用其他版本需要修改优化器可以直接在这里改,但我也不太了解这一特性(比如使用其他优化器)是否受accelerate支持。(现在使用的AdamW确实也比较占显存)- 代码实现里没有直接用deepspeed的高级特性,因此应该对package版本不敏感,只要升级后能成功运行就可以。
- 对您提到的
bs=1,block_size 512,模型是13B
这一设置,因我目前没有能使用的8卡机器,没法尝试复现。可以考虑设置sh脚本里ranking_len=1
试一下是否能正常训练(等同于SFT)。我们之前有一种简单的换算,比如,bs=1, ranking_len=2
和bs=2, ranking_len=1
所需资源应该差不多。- 很抱歉,在 pro #69 中我对
do_validation
的描述不够清晰。这个选项本身和显存无关,而是开启之后会在可见显卡中选择最后一张放置reward model用于validation。因此,假定您在使用一台8卡机器,关闭do_validation
后需要将sh脚本的--num_processes 7
也修改一下,比如修改为8。若仅在ds_config.yaml
中修改是无效的,因为在命令中直接指定的值会更优先(当然在关掉do_validation
后,于命令中直接删掉--num_processes 7
,之后通过ds_config.yaml
控制用卡数量也可以)。- 对您提到升级deepspeed后能顺利通过prepare,这个问题的原因我也不太了解。我注意到您使用的
tigerbot-13b-base
是基于transformers 4.31.0
的,而在后续版本中transformers确实修改了llama的实现。因此,您或可尝试将所有package都升级至比较新的版本,如上所述,PRO的代码实现应该是对具体版本不敏感的,只要package之间互相能兼容就可以。感谢您的后续跟进和建议。 1、我后面有注意到
do_validation
的实际功能,去掉该参数后,我有将--num_processes
设置为8,将最后一张卡也利用起来。 2、减小ranking_len 按照您的建议,从2调为1,可以跑,显存使用量77G/80G,这不太合理,我之前预训练同样的13B模型,block_size 512,per_device_train_batch_size可以设为64 升级torch版本,跑同样的程序,显存使用量66G/80G,使用量有减少,但调为2还是爆了 3、升级库 升级transformers,没用,索性将全部库都更新一遍,还是没用。 4、更换小模型 更换成一个1.3B的小模型,可以跑。 目前还找不到原因,后面要看能不能申请更多gpu来调试了...也非常感谢您那边的积极反馈~ 我自己还有一个好奇的点是,
per_device_train_batch_size=64
这个设置,确实是很惊讶可以开这么大,是在peft设置下或者使用了量化模型吗?噢,是用LoRA了
了解,不过用LoRA能开64也很惊讶,可能是使用的卡比较多www。
当时是用6卡训。前辈,有空可否看下邮箱,我有些问题想请教下,给您发了个邮件。
你好我跑PRO训练代码会报OOM,我是80G的A800,训练13B的模型,按道理应该不会爆啊 我把batch size设为1,block_size设为100,还是爆了,不知道问题出在哪? train_hh.sh:
日志: