baichuan-inc / Baichuan2

A series of large language models developed by Baichuan Intelligent Technology
https://huggingface.co/baichuan-inc
Apache License 2.0
4.03k stars 289 forks source link

训练Baichuan2-7B-Base报OOM异常 #333

Open guoyjalihy opened 6 months ago

guoyjalihy commented 6 months ago

问题描述:

nvidia v100单卡32G显存 lora训练Baichuan2-7B-Base报OOM异常:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.92 GiB (GPU 0; 31.74 GiB total capacity; 30.00 GiB already allocated; 304.88 MiB free; 30.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 理论上32G显存是足够lora训练的吧,请问大佬们谁知道具体原因?

训练脚本:

`hostfile=""

MODEL_PATH='/models/Baichuan2-13B-Chat-4bits'

MODEL_PATH=/models/Baichuan2-7B-Base' TRAIN_FILE_PATH='/models/Baichuan2' deepspeed --hostfile=$hostfile $TRAIN_FILE_PATH/fine-tune/fine-tune.py \ --report_to "none" \ --data_path $TRAIN_FILE_PATH/fine-tune/data/small.json \ --model_name_or_path $MODEL_PATH \ --output_dir $TRAIN_FILE_PATH/output \ --model_max_length 256 \ --num_train_epochs 4 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --save_strategy epoch \ --learning_rate 2e-2 \ --lr_scheduler_type constant \ --adam_beta1 0.9 \ --adam_beta2 0.98 \ --adam_epsilon 1e-8 \ --max_grad_norm 1.0 \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --logging_steps 10 \ --gradient_checkpointing True \ --deepspeed $TRAIN_FILE_PATH/fine-tune/ds_config.json \ --bf16 False \ --tf32 False \ --use_lora True `

完整报错日志

`root@ef36053a7fd7:/data/mlops/models/Baichuan2# ./train.sh [2023-12-28 16:17:26,183] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-28 16:17:33,206] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-12-28 16:17:33,207] [INFO] [runner.py:571:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /data/mlops/models/Baichuan2/fine-tune/fine-tune.py --report_to none --data_path /data/mlops/models/Baichuan2/fine-tune/data/small.json --model_name_or_path /data/mlops/models/Baichuan2-7B-Base --output_dir /data/mlops/models/Baichuan2/output --model_max_length 256 --num_train_epochs 4 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_strategy epoch --learning_rate 2e-2 --lr_scheduler_type constant --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-8 --max_grad_norm 1.0 --weight_decay 1e-4 --warmup_ratio 0.0 --logging_steps 10 --gradient_checkpointing True --deepspeed /data/mlops/models/Baichuan2/fine-tune/ds_config.json --bf16 False --tf32 False [2023-12-28 16:17:39,235] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.0 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.17.1-1 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.0 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-12-28 16:17:43,933] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1 [2023-12-28 16:17:43,933] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-12-28 16:17:43,934] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-12-28 16:17:43,934] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-12-28 16:17:43,934] [INFO] [launch.py:163:main] dist_world_size=1 [2023-12-28 16:17:43,934] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 2023/12/28 16:17:49 WARNING mlflow.utils.git_utils: Failed to import Git (the Git executable is probably not on your PATH), so Git SHA is not available. Error: Failed to initialize: Bad git executable. The git executable must be specified in one of the following ways:

All git commands will error until this is rectified.

This initial warning can be silenced or aggravated in the future by setting the $GIT_PYTHON_REFRESH environment variable. Use one of the following values:

Example: export GIT_PYTHON_REFRESH=quiet

[2023-12-28 16:17:50,775] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-28 16:17:53,883] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-28 16:17:53,883] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

model_args: ModelArguments(model_name_or_path='/data/mlops/models/Baichuan2-7B-Base') data_args DataArguments(data_path='/data/mlops/models/Baichuan2/fine-tune/data/small.json') training_args TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.98, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=/data/mlops/models/Baichuan2/fine-tune/ds_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.02, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/data/mlops/models/Baichuan2/output/runs/Dec28_16-17-49_ef36053a7fd7, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=constant, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, model_max_length=256, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=4.0, optim=adamw_torch, optim_args=None, output_dir=/data/mlops/models/Baichuan2/output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=/data/mlops/models/Baichuan2/output, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=500, save_strategy=epoch, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=False, tf32=False, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_lora=True, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0001, ) Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers pip install xformers. [2023-12-28 16:18:04,302] [INFO] [partition_parameters.py:348:exit] finished initializing model - num_params = 227, num_elems = 7.51B Loading checkpoint shards: 0%| | 0/2 [03:55<?, ?it/s] Traceback (most recent call last): File "/data/mlops/models/Baichuan2/fine-tune/fine-tune.py", line 173, in train() File "/data/mlops/models/Baichuan2/fine-tune/fine-tune.py", line 131, in train model = transformers.AutoModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained return model_class.from_pretrained( File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py", line 658, in from_pretrained return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, model_args, File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3706, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4134, in _load_pretrained_model error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix) File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 606, in _load_state_dict_into_model load(model_to_load, state_dict, prefix=start_prefix) File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 604, in load load(child, state_dict, prefix + name + ".") File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 604, in load load(child, state_dict, prefix + name + ".") File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 596, in load with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0): File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 2123, in enter self.params[0].all_gather(param_list=self.params) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1062, in all_gather return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy) File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1399, in _all_gather ret_value = self._allgather_params(all_gather_list, hierarchy=hierarchy) File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1832, in _allgather_params replicated_tensor = torch.empty(param.ds_shape, dtype=param.ds_tensor.dtype, device=self.local_device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.92 GiB (GPU 0; 31.74 GiB total capacity; 30.00 GiB already allocated; 304.88 MiB free; 30.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-12-28 16:22:03,420] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1982 [2023-12-28 16:22:03,439] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', '/data/mlops/models/Baichuan2/fine-tune/fine-tune.py', '--local_rank=0', '--report_to', 'none', '--data_path', '/data/mlops/models/Baichuan2/fine-tune/data/small.json', '--model_name_or_path', '/data/mlops/models/Baichuan2-7B-Base', '--output_dir', '/data/mlops/models/Baichuan2/output', '--model_max_length', '256', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--save_strategy', 'epoch', '--learning_rate', '2e-2', '--lr_scheduler_type', 'constant', '--adam_beta1', '0.9', '--adam_beta2', '0.98', '--adam_epsilon', '1e-8', '--max_grad_norm', '1.0', '--weight_decay', '1e-4', '--warmup_ratio', '0.0', '--logging_steps', '10', '--gradient_checkpointing', 'True', '--deepspeed', '/data/mlops/models/Baichuan2/fine-tune/ds_config.json', '--bf16', 'False', '--tf32', 'False'] exits with return code = 1`

guoyjalihy commented 6 months ago

补充ds_config.json文件 { "train_batch_size": "auto", "train_micro_batch_size_per_gpu" :"auto", "gradient_accumulation_steps": "auto", "gradient_clipping": 1.0, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "stage3_gather_16bit_weights_on_model_save": true }, "flops_profiler": { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } }

yuwanglang commented 6 months ago

你好,请问解决了吗,我也遇到这个问题

JJASMINE22 commented 5 months ago

显存溢出可能在于v100不支持bf16模式,实在不行再缩短最大序列长度吧,另外,学习率是否设置的过高了,容易导致过拟合或引发loss nan