THUDM / CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B
Apache License 2.0
2k stars 132 forks source link

单机多卡、多机多卡一直爆显存 #67

Closed chensongcan closed 3 months ago

chensongcan commented 3 months ago

System Info / 系統信息

80G H800 cuda11.8 python3.8.13

Who can help? / 谁可以帮助到您?

@zRzRzRzRzRzRzR @1049451037

Information / 问题信息

Reproduction / 复现过程

torch启动 MODEL_Path="./cogvlm2-llama3-chat-19B/" train_data="./cogvlm_train.json" epochs=3 lr=8e-6 batch_size=1 output_dir="./output/" deepspeed_config_file="./finetune_demo/ds_config.yaml"

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun --nnodes ${tmp_nodes} --nproc_per_node 8 \ --master_addr ${tmp_master_addr} --node_rank ${tmp_node_rank} \ --master_port ${tmp_master_port} .//finetune_demo/train.py \ --lr ${lr} \ --num_epochs ${epochs} \ --batch_size ${batch_size} \ --max_input_len 512 \ --max_output_len 200 \ --save_step 200 \ --model_path ${MODEL_Path} \ --dataset_path ${train_data} \ --save_path ${output_dir} \ --ds_config ${deepspeed_config_file} \

[2024-05-30 08:23:52,121] [INFO] [config.py:1000:print]   bfloat16_enabled ............. True

[2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fe7bb3699a0> [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] communication_data_type ...... None [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] disable_allgather ............ False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] dump_state ................... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2024-05-30 08:23:52,121] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] fp16_enabled ................. False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] global_rank .................. 0 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] grad_accum_dtype ............. None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] gradient_clipping ............ 0.1 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] graph_harvesting ............. False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] memory_breakdown ............. False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] optimizer_name ............... None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] optimizer_params ............. None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] pld_enabled .................. False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] pld_params ................... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] prescale_gradients ........... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] scheduler_name ............... None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] scheduler_params ............. None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] sparse_attention ............. None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] steps_per_print .............. inf [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] train_batch_size ............. 8 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 1 [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] use_data_before_expertparallel False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] wall_clock_breakdown ......... False [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] weight_quantization_config ... None [2024-05-30 08:23:52,122] [INFO] [config.py:1000:print] world_size ................... 8 [2024-05-30 08:23:52,123] [INFO] [config.py:1000:print] zero_allow_untested_optimizer True [2024-05-30 08:23:52,123] [INFO] [config.py:1000:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=40000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=100000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-05-30 08:23:52,123] [INFO] [config.py:1000:print] zero_enabled ................. True [2024-05-30 08:23:52,123] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2024-05-30 08:23:52,123] [INFO] [config.py:1000:print] zero_optimization_stage ...... 2 [2024-05-30 08:23:52,123] [INFO] [config.py:986:print_user_config] json = { "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 1, "steps_per_print": inf, "gradient_clipping": 0.1, "zero_optimization": { "stage": 2, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 4.000000e+07, "allgather_bucket_size": 1.000000e+08, "load_from_fp32_weights": false, "round_robin_gradients": false }, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "zero_allow_untested_optimizer": true, "bf16": { "enabled": true }, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false }, "wall_clock_breakdown": false, "fp16": { "enabled": false } } INFO:main:Preparation done. Starting training...

0% 0/1120 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:

Expected behavior / 期待表现

默认lora参数微调,按照作者说法75g显存 8卡可以微调,但是我用了一台机器,以及多台(4-6)机器都爆显存溢出,请问是哪里出的问题

zRzRzRzRzRzRzR commented 3 months ago

更新了hf最新的代码了吗,modeling_cogvlm要更新

Marcovaldon commented 3 months ago

OOM +1

ailun885757124 commented 3 months ago

OOM +1, on a 8xV100(256GB in total) machine

zRzRzRzRzRzRzR commented 3 months ago

8卡V100带不动,需要8卡80G的A100 或者H100

Jade0321 commented 3 months ago

8卡的A100也带不动啊

Marcovaldon commented 3 months ago

8卡的A100也带不动啊

每张卡的batch=1就可以了