THUDM / VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Apache License 2.0
4.07k stars 414 forks source link

在3090微调模型报错,请帮忙看一下 #279

Open chaijunmaomao opened 11 months ago

chaijunmaomao commented 11 months ago

NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 4 --skip-init --fp16 --use_lora [2023-09-25 15:15:47,568] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-25 15:15:49,632] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0: setting --include=localhost:0 [2023-09-25 15:15:49,662] [INFO] [runner.py:570:main] cmd = /home/qxgao/miniconda3/envs/python3.7/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 4 --skip-init --fp16 --use_lora [2023-09-25 15:15:51,330] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-25 15:15:53,132] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-09-25 15:15:53,133] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-09-25 15:15:53,133] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-09-25 15:15:53,133] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-09-25 15:15:53,133] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-09-25 15:15:53,133] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-09-25 15:15:53,133] [INFO] [launch.py:163:main] dist_world_size=1 [2023-09-25 15:15:53,133] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-09-25 15:15:54,801] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-25 15:16:01,589] [INFO] [comm.py:637:init_distributed] cdb=None [2023-09-25 15:16:01,590] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-25 15:16:01,590] [INFO] [checkpointing.py:1030:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-09-25 15:16:01,591] [INFO] [checkpointing.py:232:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-09-25 15:16:03,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 4057859 [2023-09-25 15:16:03,152] [ERROR] [launch.py:321:sigkill_handler] ['/home/qxgao/miniconda3/envs/python3.7/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '4', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

1049451037 commented 11 months ago

看起来是内存或者显存不足加载失败了。可以:

  1. 尝试调小batch-size
  2. 尝试换成qlora