OpenBMB / VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
1.06k stars 93 forks source link

微调进程被kill #29

Closed kydbj closed 7 months ago

kydbj commented 11 months ago

微调时出现的错误,用了两张A100,80G。cuda12.0的,安装的conda环境是cu117的

-------final CMD is------ deepspeed ./finetune/ft_viscpm_chat/train_viscpm_chat.py --image_path /home/lon/zyx/VisCPM/finetune/ft_viscpm_chat/coco/train2017/ --text_path /home/lon/zyx/VisCPM/finetune/ft_viscpm_chat/llava_instruct_150k_zh.json --llm_path ./config/cpm-bee-10b.json --exp_name ft_viscpm_chat --model_dir /home/lon/zyx/model-weights/VisCPM/weight/ --query_num 64 --max_len 512 --batch_size 1 --save_step 500 --epochs 5 --deepspeed_config ./finetune/ft_viscpm_chat/config/deepspeed/viscpm_chat_ft.json --sft --tune_llm --tune_vision -------final CMD end------ [2023-09-17 15:57:24,702] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=6,7: setting --include=localhost:6,7 [2023-09-17 15:57:24,756] [INFO] [runner.py:540:main] cmd = /home/lon/anaconda3/envs/viscpm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ./finetune/ft_viscpm_chat/train_viscpm_chat.py --image_path /home/lon/zyx/VisCPM/finetune/ft_viscpm_chat/coco/train2017/ --text_path /home/lon/zyx/VisCPM/finetune/ft_viscpm_chat/llava_instruct_150k_zh.json --llm_path ./config/cpm-bee-10b.json --exp_name ft_viscpm_chat --model_dir /home/lon/zyx/model-weights/VisCPM/weight/ --query_num 64 --max_len 512 --batch_size 1 --save_step 500 --epochs 5 --deepspeed_config ./finetune/ft_viscpm_chat/config/deepspeed/viscpm_chat_ft.json --sft --tune_llm --tune_vision [2023-09-17 15:57:26,315] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [6, 7]} [2023-09-17 15:57:26,316] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-09-17 15:57:26,316] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-09-17 15:57:26,316] [INFO] [launch.py:247:main] dist_world_size=2 [2023-09-17 15:57:26,316] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=6,7 [2023-09-17,15:57:28][I][1943889-fin.ini-initializer.py:116]- args: {args} [2023-09-17,15:57:28][I][1943889-fin.uti.uti-utils.py:111]- init_distributed_mode LOCAL_RANK | distributed init (rank 1): env://, gpu 1 | distributed init (rank 0): env://, gpu 0 [2023-09-17,15:57:29][I][1943890-tor.dis.dis-distributed_c10d.py:319]- Added key: store_based_barrier_key:1 to store for rank: 1 [2023-09-17,15:57:29][I][1943889-tor.dis.dis-distributed_c10d.py:319]- Added key: store_based_barrier_key:1 to store for rank: 0 [2023-09-17,15:57:29][I][1943889-tor.dis.dis-distributed_c10d.py:353]- Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [2023-09-17,15:57:29][I][1943890-tor.dis.dis-distributed_c10d.py:353]- Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [2023-09-17 15:57:31,322] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1943889 [2023-09-17 15:57:31,348] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1943890 [2023-09-17 15:57:31,348] [ERROR] [launch.py:434:sigkill_handler] ['/home/lon/anaconda3/envs/viscpm/bin/python', '-u', './finetune/ft_viscpm_chat/train_viscpm_chat.py', '--local_rank=1', '--image_path', '/home/lon/zyx/VisCPM/finetune/ft_viscpm_chat/coco/train2017/', '--text_path', '/home/lon/zyx/VisCPM/finetune/ft_viscpm_chat/llava_instruct_150k_zh.json', '--llm_path', './config/cpm-bee-10b.json', '--exp_name', 'ft_viscpm_chat', '--model_dir', '/home/lon/zyx/model-weights/VisCPM/weight/', '--query_num', '64', '--max_len', '512', '--batch_size', '1', '--save_step', '500', '--epochs', '5', '--deepspeed_config', './finetune/ft_viscpm_chat/config/deepspeed/viscpm_chat_ft.json', '--sft', '--tune_llm', '--tune_vision'] exits with return code = -11

JamesHujy commented 7 months ago

可以确定一下是否是内存大小的限制