两种微调方法都被kill掉了

我用了两张A100 80G的卡进行微调的，但是两种微调方法被kill掉了

这是执行 finetune_visualglm.sh 后的报错

NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 6 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 100 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 2 --skip-init --fp16 --use_lora [2023-09-17 10:59:01,811] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 10:59:02,902] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=6,7: setting --include=localhost:6,7 [2023-09-17 10:59:02,956] [INFO] [runner.py:570:main] cmd = /home/lon/anaconda3/envs/visualglm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbNiwgN119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 6 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 100 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 2 --skip-init --fp16 --use_lora [2023-09-17 10:59:03,731] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 10:59:04,975] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-09-17 10:59:04,975] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-09-17 10:59:04,975] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-09-17 10:59:04,975] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [6, 7]} [2023-09-17 10:59:04,975] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-09-17 10:59:04,975] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-09-17 10:59:04,975] [INFO] [launch.py:163:main] dist_world_size=2 [2023-09-17 10:59:04,975] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=6,7 [2023-09-17 10:59:05,835] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 10:59:05,854] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 10:59:07,880] [INFO] using world size: 2 and model-parallel size: 1 [2023-09-17 10:59:07,880] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2023-09-17 10:59:08,801] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-17 10:59:08,817] [INFO] [comm.py:637:init_distributed] cdb=None [2023-09-17 10:59:08,818] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-17 10:59:08,823] [INFO] [comm.py:637:init_distributed] cdb=None [2023-09-17 10:59:08,824] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-17 10:59:08,825] [INFO] [checkpointing.py:1030:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-09-17 10:59:08,825] [INFO] [checkpointing.py:232:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-09-17 10:59:08,827] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /home/lon/anaconda3/envs/visualglm/lib/python3.7/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/lon/anaconda3/envs/visualglm/lib/python3.7/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2023-09-17 10:59:18,051] [INFO] [RANK 0] replacing layer 0 attention with lora [2023-09-17 10:59:18,456] [INFO] [RANK 0] replacing layer 14 attention with lora [2023-09-17 10:59:18,854] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802586624 [2023-09-17 10:59:19,592] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/lon/zyx/model-weights/VisualGLM-6B/finetune-weight/visualglm-6b/1/mp_rank_00_model_states.pt [2023-09-17 10:59:29,107] [INFO] [RANK 0] Will continue but found unexpected_keys! Check whether you are loading correct checkpoints: ['transformer.position_embeddings.weight']. [2023-09-17 10:59:29,107] [INFO] [RANK 0] > successfully loaded /home/lon/zyx/model-weights/VisualGLM-6B/finetune-weight/visualglm-6b/1/mp_rank_00_model_states.pt [2023-09-17 10:59:32,507] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2023-09-17 10:59:32,663] [INFO] [RANK 0] > Set tokenizer as a /home/lon/zyx/model-weights/VisualGLM-6B/VisualGLM-weight tokenizer! Now you can get_tokenizer() everywhere. hgx-023:1910570:1910570 [0] NCCL INFO Bootstrap : Using ibs10:10.1.5.23<0> hgx-023:1910570:1910570 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hgx-023:1910570:1910570 [0] misc/cudawrap.cc:90 NCCL WARN Failed to find CUDA library in (null) (NCCL_CUDA_PATH=(null)) NCCL version 2.14.3+cuda11.7 hgx-023:1910571:1910571 [1] misc/cudawrap.cc:90 NCCL WARN Failed to find CUDA library in (null) (NCCL_CUDA_PATH=(null)) hgx-023:1910570:1910995 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. hgx-023:1910571:1910571 [1] NCCL INFO Bootstrap : Using ibs10:10.1.5.23<0> hgx-023:1910571:1910571 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hgx-023:1910571:1910996 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. hgx-023:1910570:1910995 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_6:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [8]mlx5_bond_0:1/RoCE [RO]; OOB ibs10:10.1.5.23<0> hgx-023:1910570:1910995 [0] NCCL INFO Using network IB hgx-023:1910571:1910996 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_6:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [8]mlx5_bond_0:1/RoCE [RO]; OOB ibs10:10.1.5.23<0> hgx-023:1910571:1910996 [1] NCCL INFO Using network IB [2023-09-17 10:59:34,023] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1910570 [2023-09-17 10:59:34,024] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1910571 [2023-09-17 10:59:34,049] [ERROR] [launch.py:321:sigkill_handler] ['/home/lon/anaconda3/envs/visualglm/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=1', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '6', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '100', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '2', '--skip-init', '--fp16', '--use_lora'] exits with return code = -11

这个是 finetune_visualglm_qlora.sh 后的报错

NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:6,7 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 1 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora [2023-09-17 11:11:54,930] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 11:11:55,990] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=6,7 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed. [2023-09-17 11:11:56,044] [INFO] [runner.py:570:main] cmd = /home/lon/anaconda3/envs/visualglm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbNiwgN119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 1 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora [2023-09-17 11:11:56,859] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 11:11:58,093] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-09-17 11:11:58,093] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-09-17 11:11:58,093] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-09-17 11:11:58,093] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [6, 7]} [2023-09-17 11:11:58,094] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-09-17 11:11:58,094] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-09-17 11:11:58,094] [INFO] [launch.py:163:main] dist_world_size=2 [2023-09-17 11:11:58,094] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=6,7 [2023-09-17 11:11:58,934] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 11:11:58,943] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-17 11:12:00,890] [INFO] using world size: 2 and model-parallel size: 1 [2023-09-17 11:12:00,890] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2023-09-17 11:12:01,890] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-17 11:12:01,907] [INFO] [comm.py:637:init_distributed] cdb=None [2023-09-17 11:12:01,908] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-17 11:12:01,912] [INFO] [comm.py:637:init_distributed] cdb=None [2023-09-17 11:12:01,913] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-17 11:12:01,914] [INFO] [checkpointing.py:1030:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-09-17 11:12:01,914] [INFO] [checkpointing.py:232:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-09-17 11:12:01,916] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /home/lon/anaconda3/envs/visualglm/lib/python3.7/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/lon/anaconda3/envs/visualglm/lib/python3.7/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2023-09-17 11:12:11,130] [INFO] [RANK 0] replacing layer 0 attention with lora [2023-09-17 11:12:11,559] [INFO] [RANK 0] replacing layer 14 attention with lora [2023-09-17 11:12:11,982] [INFO] [RANK 0] replacing chatglm linear layer with 4bit [2023-09-17 11:12:45,360] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2023-09-17 11:12:51,363] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/lon/zyx/model-weights/VisualGLM-6B/finetune-weight/visualglm-6b/1/mp_rank_00_model_states.pt /home/lon/zyx/model-weights/VisualGLM-6B/VisualGLM-weight [2023-09-17 11:13:03,263] [INFO] [RANK 0] Will continue but found unexpected_keys! Check whether you are loading correct checkpoints: ['transformer.position_embeddings.weight']. [2023-09-17 11:13:03,264] [INFO] [RANK 0] > successfully loaded /home/lon/zyx/model-weights/VisualGLM-6B/finetune-weight/visualglm-6b/1/mp_rank_00_model_states.pt /home/lon/zyx/model-weights/VisualGLM-6B/VisualGLM-weight [2023-09-17 11:13:09,969] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2023-09-17 11:13:10,151] [INFO] [RANK 0] > Set tokenizer as a /home/lon/zyx/model-weights/VisualGLM-6B/VisualGLM-weight tokenizer! Now you can get_tokenizer() everywhere. hgx-023:1912489:1912489 [0] NCCL INFO Bootstrap : Using ibs10:10.1.5.23<0> hgx-023:1912489:1912489 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hgx-023:1912489:1912489 [0] misc/cudawrap.cc:90 NCCL WARN Failed to find CUDA library in (null) (NCCL_CUDA_PATH=(null)) NCCL version 2.14.3+cuda11.7 hgx-023:1912490:1912490 [1] misc/cudawrap.cc:90 NCCL WARN Failed to find CUDA library in (null) (NCCL_CUDA_PATH=(null)) hgx-023:1912489:1912953 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. hgx-023:1912490:1912490 [1] NCCL INFO Bootstrap : Using ibs10:10.1.5.23<0> hgx-023:1912490:1912490 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hgx-023:1912490:1912954 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0. hgx-023:1912490:1912954 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_6:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [8]mlx5_bond_0:1/RoCE [RO]; OOB ibs10:10.1.5.23<0> hgx-023:1912490:1912954 [1] NCCL INFO Using network IB hgx-023:1912489:1912953 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_6:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [8]mlx5_bond_0:1/RoCE [RO]; OOB ibs10:10.1.5.23<0> hgx-023:1912489:1912953 [0] NCCL INFO Using network IB [2023-09-17 11:13:11,184] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1912489 [2023-09-17 11:13:11,212] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1912490 [2023-09-17 11:13:11,213] [ERROR] [launch.py:321:sigkill_handler] ['/home/lon/anaconda3/envs/visualglm/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=1', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = -11

THUDM / VisualGLM-6B

两种微调方法都被kill掉了 #272