[BUG/Help] ds_train_finetune.sh threw CUDA OOM ERROR with 6*A100-40G (some trials and questions)

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Hi, I am trying to use ds_train_finetune.sh to finetune chatglm-6b with my dialogue data. I prepared my data as README.md suggests and editted the shell script adding the --history_column. I have a few A100-40G but it still threw CUDA OOM error. Does anyone know how to get it work?? It is so strange.

And I wonder if the developer can share the succesful log from your local environment? Can we have a more detailed document describing under what setup(GPU num and GPU memory), what config, how long has been taken for a better estimation before we start to work on it.

configuration in my ds_train_finetune_chat.sh (based on the official ds_train_finetune.sh)

deepspeed_config=deepspeed.json
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5

deepspeed --num_gpus=6 --master_port $MASTER_PORT main.py \
    --deepspeed $deepspeed_config \
    --do_train \
    --train_file $CHAT_TRAIN_DATA \
    --test_file $CHAT_VAL_DATA \
    --prompt_column prompt \
    --response_column response \
    --history_column history \
    --overwrite_cache \
    --model_name_or_path $modelpath \
    --output_dir $outdir/meishubao-chatglm-6b-ft-$LR \
    --overwrite_output_dir \
    --max_source_length 256 \
    --max_target_length 256 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --predict_with_generate \
    --max_steps 4000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate $LR \
    --fp16

The error log:

[2023-06-13 09:23:39,841] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-06-13 09:23:39,841] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-06-13 09:23:39,841] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-06-13 09:23:39,853] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-06-13 09:23:39,854] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'transformers.optimization.AdamW'>
[2023-06-13 09:23:39,854] [WARNING] [engine.py:1214:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2023-06-13 09:23:39,854] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500000000
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500000000
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: False
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.0702829360961914 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.10293245315551758 seconds
Time to load utils op: 0.10214400291442871 seconds
Loading extension module utils...
Time to load utils op: 0.10248827934265137 seconds
Loading extension module utils...
Time to load utils op: 0.10331988334655762 seconds
Loading extension module utils...
Time to load utils op: 0.10322928428649902 seconds
Rank: 2 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 4 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 3 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 0 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 1 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 5 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00032019615173339844 seconds
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...Time to load utils op: 0.0003676414489746094 seconds

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Time to load utils op: 0.0003476142883300781 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003695487976074219 seconds
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00048661231994628906 seconds
[2023-06-13 09:23:52,339] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-06-13 09:23:52,340] [INFO] [utils.py:830:see_memory_usage] MA 15.33 GB         Max_MA 17.25 GB         CA 15.35 GB         Max_CA 17 GB 
[2023-06-13 09:23:52,340] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 34.17 GB, percent = 3.4%
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[2023-06-13 09:23:52,451] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-06-13 09:23:52,452] [INFO] [utils.py:830:see_memory_usage] MA 23.0 GB         Max_MA 30.66 GB         CA 30.68 GB         Max_CA 31 GB 
[2023-06-13 09:23:52,452] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 34.2 GB, percent = 3.4%
[2023-06-13 09:23:52,452] [INFO] [stage_1_and_2.py:520:__init__] optimizer state initialized
[2023-06-13 09:23:52,548] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-06-13 09:23:52,548] [INFO] [utils.py:830:see_memory_usage] MA 23.0 GB         Max_MA 23.0 GB         CA 30.68 GB         Max_CA 31 GB 
[2023-06-13 09:23:52,548] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 34.66 GB, percent = 3.4%
[2023-06-13 09:23:52,550] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-06-13 09:23:52,550] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-06-13 09:23:52,550] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fe184173ca0>
[2023-06-13 09:23:52,551] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001, 0.0001], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-06-13 09:23:52,551] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   amp_enabled .................. False
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   amp_params ................... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   bfloat16_enabled ............. False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   checkpoint_parallel_write_pipeline  False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   checkpoint_tag_validation_enabled  True
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   checkpoint_tag_validation_fail  False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fe184173760>
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   communication_data_type ...... None
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   curriculum_enabled_legacy .... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   curriculum_params_legacy ..... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   data_efficiency_enabled ...... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   dataloader_drop_last ......... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   disable_allgather ............ False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   dump_state ................... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_enabled ........... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_gas_boundary_resolution  1
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_layer_num ......... 0
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_max_iter .......... 100
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_stability ......... 1e-06
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_tol ............... 0.01
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_verbose ........... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   elasticity_enabled ........... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   fp16_auto_cast ............... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   fp16_enabled ................. True
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   fp16_master_weights_and_gradients  False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   global_rank .................. 0
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   grad_accum_dtype ............. None
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   gradient_accumulation_steps .. 1
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   gradient_clipping ............ 0.0
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   gradient_predivide_factor .... 1.0
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   initial_dynamic_scale ........ 65536
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   load_universal_checkpoint .... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   loss_scale ................... 0
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   memory_breakdown ............. False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   optimizer_legacy_fusion ...... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   optimizer_name ............... None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   optimizer_params ............. None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   pld_enabled .................. False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   pld_params ................... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   prescale_gradients ........... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   scheduler_name ............... None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   scheduler_params ............. None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   sparse_attention ............. None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   sparse_gradients_enabled ..... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   steps_per_print .............. 10
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   train_batch_size ............. 384
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   train_micro_batch_size_per_gpu  64
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   use_node_local_storage ....... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   wall_clock_breakdown ......... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   world_size ................... 6
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_allow_untested_optimizer  True
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_enabled ................. True
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_force_ds_cpu_optimizer .. True
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_optimization_stage ...... 2
[2023-06-13 09:23:52,553] [INFO] [config.py:1007:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 64, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "initial_scale_power": 16, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": false, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true
    }
}
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003085136413574219 seconds
  0%|                                                                                                                                                                                              | 0/4000 [00:00<?, ?it/s]06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
Traceback (most recent call last):
    return forward_call(*input, **kwargs)
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 1; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
        transformer_outputs = self.transformer(loss = self.compute_loss(model, inputs)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 2; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 3; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 4; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 5; 39.59 GiB total capacity; 33.27 GiB already allocated; 708.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 39.59 GiB total capacity; 33.27 GiB already allocated; 708.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                              | 0/4000 [00:02<?, ?it/s]
[2023-06-13 09:23:56,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18020
[2023-06-13 09:23:57,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18021
[2023-06-13 09:23:57,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18022
[2023-06-13 09:23:57,500] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18023
[2023-06-13 09:23:57,832] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18024
[2023-06-13 09:23:57,832] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18025
[2023-06-13 09:23:57,833] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=5', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', '/workspace/datasets/meishubao/meishubao_train.json', '--test_file', '/workspace/datasets/meishubao/meishubao_eval.json', '--prompt_column', 'prompt', '--response_column', 'response', '--history_column', 'history', '--overwrite_cache', '--model_name_or_path', '/workspace/models/THUDM/chatglm-6b', '--output_dir', '/workspace/models/output/meishubao-chatglm-6b-finetune//meishubao-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '256', '--max_target_length', '256', '--per_device_train_batch_size', '64', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '4000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = 1

Expected Behavior

Make finetuneing work.

Steps To Reproduce

use the default deepspeed configuration, and the conversational data formatted as https://github.com/THUDM/ChatGLM-6B/tree/35122e39444a4671106ba13af1fe41729ae86c0e/ptuning instructed.

Edit the default ds_train_finetune.sh,changing or adding the following argument:

--prompt_column prompt \
--response_column response \
--history_column history \

run bash ds_train_finetune.sh

Environment

- OS:Linux
- Python:Python 3.10.8
- Transformers:4.29.2
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

THUDM / ChatGLM-6B