Open guoyjalihy opened 6 months ago
补充ds_config.json文件
{ "train_batch_size": "auto", "train_micro_batch_size_per_gpu" :"auto", "gradient_accumulation_steps": "auto", "gradient_clipping": 1.0, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "stage3_gather_16bit_weights_on_model_save": true }, "flops_profiler": { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } }
你好,请问解决了吗,我也遇到这个问题
显存溢出可能在于v100不支持bf16模式,实在不行再缩短最大序列长度吧,另外,学习率是否设置的过高了,容易导致过拟合或引发loss nan
问题描述:
nvidia v100单卡32G显存 lora训练Baichuan2-7B-Base报OOM异常:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.92 GiB (GPU 0; 31.74 GiB total capacity; 30.00 GiB already allocated; 304.88 MiB free; 30.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 理论上32G显存是足够lora训练的吧,请问大佬们谁知道具体原因?
训练脚本:
`hostfile=""
MODEL_PATH='/models/Baichuan2-13B-Chat-4bits'
MODEL_PATH=/models/Baichuan2-7B-Base' TRAIN_FILE_PATH='/models/Baichuan2' deepspeed --hostfile=$hostfile $TRAIN_FILE_PATH/fine-tune/fine-tune.py \ --report_to "none" \ --data_path $TRAIN_FILE_PATH/fine-tune/data/small.json \ --model_name_or_path $MODEL_PATH \ --output_dir $TRAIN_FILE_PATH/output \ --model_max_length 256 \ --num_train_epochs 4 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --save_strategy epoch \ --learning_rate 2e-2 \ --lr_scheduler_type constant \ --adam_beta1 0.9 \ --adam_beta2 0.98 \ --adam_epsilon 1e-8 \ --max_grad_norm 1.0 \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --logging_steps 10 \ --gradient_checkpointing True \ --deepspeed $TRAIN_FILE_PATH/fine-tune/ds_config.json \ --bf16 False \ --tf32 False \ --use_lora True `
完整报错日志
`root@ef36053a7fd7:/data/mlops/models/Baichuan2# ./train.sh [2023-12-28 16:17:26,183] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-28 16:17:33,206] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-12-28 16:17:33,207] [INFO] [runner.py:571:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /data/mlops/models/Baichuan2/fine-tune/fine-tune.py --report_to none --data_path /data/mlops/models/Baichuan2/fine-tune/data/small.json --model_name_or_path /data/mlops/models/Baichuan2-7B-Base --output_dir /data/mlops/models/Baichuan2/output --model_max_length 256 --num_train_epochs 4 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_strategy epoch --learning_rate 2e-2 --lr_scheduler_type constant --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-8 --max_grad_norm 1.0 --weight_decay 1e-4 --warmup_ratio 0.0 --logging_steps 10 --gradient_checkpointing True --deepspeed /data/mlops/models/Baichuan2/fine-tune/ds_config.json --bf16 False --tf32 False [2023-12-28 16:17:39,235] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.0 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.17.1-1 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.0 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1 [2023-12-28 16:17:43,933] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-12-28 16:17:43,934] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-12-28 16:17:43,934] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-12-28 16:17:43,934] [INFO] [launch.py:163:main] dist_world_size=1 [2023-12-28 16:17:43,934] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 2023/12/28 16:17:49 WARNING mlflow.utils.git_utils: Failed to import Git (the Git executable is probably not on your PATH), so Git SHA is not available. Error: Failed to initialize: Bad git executable. The git executable must be specified in one of the following ways:
All git commands will error until this is rectified.
This initial warning can be silenced or aggravated in the future by setting the $GIT_PYTHON_REFRESH environment variable. Use one of the following values:
Example: export GIT_PYTHON_REFRESH=quiet
[2023-12-28 16:17:50,775] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-28 16:17:53,883] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-28 16:17:53,883] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
model_args: ModelArguments(model_name_or_path='/data/mlops/models/Baichuan2-7B-Base') data_args DataArguments(data_path='/data/mlops/models/Baichuan2/fine-tune/data/small.json') training_args TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.98, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/data/mlops/models/Baichuan2/fine-tune/ds_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.02,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/data/mlops/models/Baichuan2/output/runs/Dec28_16-17-49_ef36053a7fd7,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=constant,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
model_max_length=256,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=4.0,
optim=adamw_torch,
optim_args=None,
output_dir=/data/mlops/models/Baichuan2/output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/data/mlops/models/Baichuan2/output,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=epoch,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=False,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_lora=True,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0001,
)
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
[2023-12-28 16:18:04,302] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 227, num_elems = 7.51B
Loading checkpoint shards: 0%| | 0/2 [03:55<?, ?it/s]
Traceback (most recent call last):
File "/data/mlops/models/Baichuan2/fine-tune/fine-tune.py", line 173, in
train()
File "/data/mlops/models/Baichuan2/fine-tune/fine-tune.py", line 131, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py", line 658, in from_pretrained
return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, model_args,
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3706, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4134, in _load_pretrained_model
error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 606, in _load_state_dict_into_model
load(model_to_load, state_dict, prefix=start_prefix)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 604, in load
load(child, state_dict, prefix + name + ".")
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 604, in load
load(child, state_dict, prefix + name + ".")
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 596, in load
with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 2123, in enter
self.params[0].all_gather(param_list=self.params)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1062, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1399, in _all_gather
ret_value = self._allgather_params(all_gather_list, hierarchy=hierarchy)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1832, in _allgather_params
replicated_tensor = torch.empty(param.ds_shape, dtype=param.ds_tensor.dtype, device=self.local_device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.92 GiB (GPU 0; 31.74 GiB total capacity; 30.00 GiB already allocated; 304.88 MiB free; 30.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-12-28 16:22:03,420] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1982
[2023-12-28 16:22:03,439] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', '/data/mlops/models/Baichuan2/fine-tune/fine-tune.py', '--local_rank=0', '--report_to', 'none', '--data_path', '/data/mlops/models/Baichuan2/fine-tune/data/small.json', '--model_name_or_path', '/data/mlops/models/Baichuan2-7B-Base', '--output_dir', '/data/mlops/models/Baichuan2/output', '--model_max_length', '256', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'epoch', '--learning_rate', '2e-2', '--lr_scheduler_type', 'constant', '--adam_beta1', '0.9', '--adam_beta2', '0.98', '--adam_epsilon', '1e-8', '--max_grad_norm', '1.0', '--weight_decay', '1e-4', '--warmup_ratio', '0.0', '--logging_steps', '10', '--gradient_checkpointing', 'True', '--deepspeed', '/data/mlops/models/Baichuan2/fine-tune/ds_config.json', '--bf16', 'False', '--tf32', 'False'] exits with return code = 1`