THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Apache License 2.0
40.69k stars 5.22k forks source link

[BUG/Help] ds_train_finetune.sh threw CUDA OOM ERROR with 6*A100-40G (some trials and questions) #1237

Open treya-lin opened 1 year ago

treya-lin commented 1 year ago

Is there an existing issue for this?

Current Behavior

Hi, I am trying to use ds_train_finetune.sh to finetune chatglm-6b with my dialogue data. I prepared my data as README.md suggests and editted the shell script adding the --history_column. I have a few A100-40G but it still threw CUDA OOM error. Does anyone know how to get it work?? It is so strange.

And I wonder if the developer can share the succesful log from your local environment? Can we have a more detailed document describing under what setup(GPU num and GPU memory), what config, how long has been taken for a better estimation before we start to work on it.

configuration in my ds_train_finetune_chat.sh (based on the official ds_train_finetune.sh)

deepspeed_config=deepspeed.json
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5

deepspeed --num_gpus=6 --master_port $MASTER_PORT main.py \
    --deepspeed $deepspeed_config \
    --do_train \
    --train_file $CHAT_TRAIN_DATA \
    --test_file $CHAT_VAL_DATA \
    --prompt_column prompt \
    --response_column response \
    --history_column history \
    --overwrite_cache \
    --model_name_or_path $modelpath \
    --output_dir $outdir/meishubao-chatglm-6b-ft-$LR \
    --overwrite_output_dir \
    --max_source_length 256 \
    --max_target_length 256 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --predict_with_generate \
    --max_steps 4000 \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate $LR \
    --fp16

The error log:

[2023-06-13 09:23:39,841] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-06-13 09:23:39,841] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-06-13 09:23:39,841] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-06-13 09:23:39,853] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-06-13 09:23:39,854] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'transformers.optimization.AdamW'>
[2023-06-13 09:23:39,854] [WARNING] [engine.py:1214:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2023-06-13 09:23:39,854] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500000000
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500000000
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: False
[2023-06-13 09:23:39,854] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.0702829360961914 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.10293245315551758 seconds
Time to load utils op: 0.10214400291442871 seconds
Loading extension module utils...
Time to load utils op: 0.10248827934265137 seconds
Loading extension module utils...
Time to load utils op: 0.10331988334655762 seconds
Loading extension module utils...
Time to load utils op: 0.10322928428649902 seconds
Rank: 2 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 4 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 3 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 0 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 1 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Rank: 5 partition count [6, 6] and sizes[(1028631212, False), (249856, False)] 
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00032019615173339844 seconds
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...Time to load utils op: 0.0003676414489746094 seconds

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Time to load utils op: 0.0003476142883300781 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003695487976074219 seconds
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00048661231994628906 seconds
[2023-06-13 09:23:52,339] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-06-13 09:23:52,340] [INFO] [utils.py:830:see_memory_usage] MA 15.33 GB         Max_MA 17.25 GB         CA 15.35 GB         Max_CA 17 GB 
[2023-06-13 09:23:52,340] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 34.17 GB, percent = 3.4%
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[2023-06-13 09:23:52,451] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-06-13 09:23:52,452] [INFO] [utils.py:830:see_memory_usage] MA 23.0 GB         Max_MA 30.66 GB         CA 30.68 GB         Max_CA 31 GB 
[2023-06-13 09:23:52,452] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 34.2 GB, percent = 3.4%
[2023-06-13 09:23:52,452] [INFO] [stage_1_and_2.py:520:__init__] optimizer state initialized
[2023-06-13 09:23:52,548] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-06-13 09:23:52,548] [INFO] [utils.py:830:see_memory_usage] MA 23.0 GB         Max_MA 23.0 GB         CA 30.68 GB         Max_CA 31 GB 
[2023-06-13 09:23:52,548] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 34.66 GB, percent = 3.4%
[2023-06-13 09:23:52,550] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-06-13 09:23:52,550] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-06-13 09:23:52,550] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fe184173ca0>
[2023-06-13 09:23:52,551] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001, 0.0001], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-06-13 09:23:52,551] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   amp_enabled .................. False
[2023-06-13 09:23:52,551] [INFO] [config.py:1022:print]   amp_params ................... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   bfloat16_enabled ............. False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   checkpoint_parallel_write_pipeline  False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   checkpoint_tag_validation_enabled  True
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   checkpoint_tag_validation_fail  False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fe184173760>
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   communication_data_type ...... None
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   curriculum_enabled_legacy .... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   curriculum_params_legacy ..... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   data_efficiency_enabled ...... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   dataloader_drop_last ......... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   disable_allgather ............ False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   dump_state ................... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_enabled ........... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_gas_boundary_resolution  1
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_layer_num ......... 0
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_max_iter .......... 100
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_stability ......... 1e-06
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_tol ............... 0.01
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   eigenvalue_verbose ........... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   elasticity_enabled ........... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   fp16_auto_cast ............... False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   fp16_enabled ................. True
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   fp16_master_weights_and_gradients  False
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   global_rank .................. 0
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   grad_accum_dtype ............. None
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   gradient_accumulation_steps .. 1
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   gradient_clipping ............ 0.0
[2023-06-13 09:23:52,552] [INFO] [config.py:1022:print]   gradient_predivide_factor .... 1.0
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   initial_dynamic_scale ........ 65536
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   load_universal_checkpoint .... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   loss_scale ................... 0
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   memory_breakdown ............. False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   optimizer_legacy_fusion ...... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   optimizer_name ............... None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   optimizer_params ............. None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   pld_enabled .................. False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   pld_params ................... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   prescale_gradients ........... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   scheduler_name ............... None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   scheduler_params ............. None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   sparse_attention ............. None
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   sparse_gradients_enabled ..... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   steps_per_print .............. 10
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   train_batch_size ............. 384
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   train_micro_batch_size_per_gpu  64
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   use_node_local_storage ....... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   wall_clock_breakdown ......... False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   world_size ................... 6
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_allow_untested_optimizer  True
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_enabled ................. True
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_force_ds_cpu_optimizer .. True
[2023-06-13 09:23:52,553] [INFO] [config.py:1022:print]   zero_optimization_stage ...... 2
[2023-06-13 09:23:52,553] [INFO] [config.py:1007:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 64, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "initial_scale_power": 16, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": false, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true
    }
}
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003085136413574219 seconds
  0%|                                                                                                                                                                                              | 0/4000 [00:00<?, ?it/s]06/13/2023 09:23:52 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
Traceback (most recent call last):
    return forward_call(*input, **kwargs)
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 1; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
        transformer_outputs = self.transformer(loss = self.compute_loss(model, inputs)

  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 2; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 3; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 4; 39.59 GiB total capacity; 33.27 GiB already allocated; 420.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 5; 39.59 GiB total capacity; 33.27 GiB already allocated; 708.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 431, in <module>
    main()
  File "/workspace/ChatGLM-6B/ptuning/main.py", line 370, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1635, in train
    return inner_training_loop(
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 1904, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2647, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/ChatGLM-6B/ptuning/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 985, in forward
    layer_ret = torch.utils.checkpoint.checkpoint(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 474, in forward
    context_layer, present, attention_probs = attention_fn(
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 309, in attention_fn
    attention_scores = attention_scores * query_key_layer_scaling_coeff
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 39.59 GiB total capacity; 33.27 GiB already allocated; 708.19 MiB free; 36.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                              | 0/4000 [00:02<?, ?it/s]
[2023-06-13 09:23:56,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18020
[2023-06-13 09:23:57,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18021
[2023-06-13 09:23:57,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18022
[2023-06-13 09:23:57,500] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18023
[2023-06-13 09:23:57,832] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18024
[2023-06-13 09:23:57,832] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18025
[2023-06-13 09:23:57,833] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=5', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', '/workspace/datasets/meishubao/meishubao_train.json', '--test_file', '/workspace/datasets/meishubao/meishubao_eval.json', '--prompt_column', 'prompt', '--response_column', 'response', '--history_column', 'history', '--overwrite_cache', '--model_name_or_path', '/workspace/models/THUDM/chatglm-6b', '--output_dir', '/workspace/models/output/meishubao-chatglm-6b-finetune//meishubao-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '256', '--max_target_length', '256', '--per_device_train_batch_size', '64', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '4000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = 1

Expected Behavior

Make finetuneing work.

Steps To Reproduce

  1. use the default deepspeed configuration, and the conversational data formatted as https://github.com/THUDM/ChatGLM-6B/tree/35122e39444a4671106ba13af1fe41729ae86c0e/ptuning instructed.
  2. Edit the default ds_train_finetune.sh,changing or adding the following argument:
    --prompt_column prompt \
    --response_column response \
    --history_column history \
  3. run bash ds_train_finetune.sh

Environment

- OS:Linux
- Python:Python 3.10.8
- Transformers:4.29.2
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

treya-lin commented 1 year ago

A temporary update:

I just got it running with 7*A100-40G and the following batch_size (batch_size as 64 will throw OOM)

--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16

Now the GPU status look like this, a bit concerning because I only have 40G on each of them. Will it goes up?

|    0   N/A  N/A    190117      C   /opt/conda/bin/python           39321MiB |
|    1   N/A  N/A    190118      C   /opt/conda/bin/python           39875MiB |
|    2   N/A  N/A    190119      C   /opt/conda/bin/python           39875MiB |
|    3   N/A  N/A    190120      C   /opt/conda/bin/python           39875MiB |
|    4   N/A  N/A    190121      C   /opt/conda/bin/python           39875MiB |
|    5   N/A  N/A    190122      C   /opt/conda/bin/python           39875MiB |
|    6   N/A  N/A    190123      C   /opt/conda/bin/python           39587MiB |

I will see if it will proceed properly. But I am still suprised that this 6B model requires a very strict demand for full-parameter finetuning. Is this normal in your local environment too? I just realized maybe it's because no offloading to cpu was used in this case.

Is it possible to add more guide on how to use zero3, cpu offload technique of deepspeed to reduce the required GPU memory? (something that this project is doing https://github.com/CVI-SZU/Linly/wiki/%E5%A2%9E%E9%87%8F%E8%AE%AD%E7%BB%83)