X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.31k stars 176 forks source link

mPLUG-Owl2,finetune,训练卡住没有输出 #188

Open LianghuiGuo opened 11 months ago

LianghuiGuo commented 11 months ago

参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗

strout中的输出

[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters
[2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion  running, len(finish_tasks): [0]
loading dataset file :  DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None)
Formatting inputs...Skip in lazy mode
Time to load utils op: 11.517198085784912 seconds
Parameter Offload: Total persistent parameters: 996352 in 418 params
Time to load utils op: 0.0002448558807373047 seconds

strerr中的输出

Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.36s/it]
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.21s/it]
Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
ic| training_args.tune_visual_abstractor: True
ic| training_args.freeze_vision_model: True
ic| len(optimizer_grouped_parameters[0]['params']): 464
    len(optimizer_grouped_parameters[1]['params']): 91
Using :/usr/local/ninja as PyTorch extensions root...
Loading extension module utils...
Using :/usr/local/ninja as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...

  0%|          | 0/4412 [00:00<?, ?it/s]

等了一晚上小时,还是0% 用的4卡A10 全参微调和lora都是这样

waltonfuture commented 11 months ago

参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗

strout中的输出

[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters
[2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion  running, len(finish_tasks): [0]
loading dataset file :  DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None)
Formatting inputs...Skip in lazy mode
Time to load utils op: 11.517198085784912 seconds
Parameter Offload: Total persistent parameters: 996352 in 418 params
Time to load utils op: 0.0002448558807373047 seconds

strerr中的输出

Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.36s/it]
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.21s/it]
Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
ic| training_args.tune_visual_abstractor: True
ic| training_args.freeze_vision_model: True
ic| len(optimizer_grouped_parameters[0]['params']): 464
    len(optimizer_grouped_parameters[1]['params']): 91
Using :/usr/local/ninja as PyTorch extensions root...
Loading extension module utils...
Using :/usr/local/ninja as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...

  0%|          | 0/4412 [00:00<?, ?it/s]

等了3-5小时,还是0% 用的4卡A10

我是训了一半就卡住了.用的4090

LianghuiGuo commented 11 months ago

参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗 strout中的输出

[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters
[2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion  running, len(finish_tasks): [0]
loading dataset file :  DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None)
Formatting inputs...Skip in lazy mode
Time to load utils op: 11.517198085784912 seconds
Parameter Offload: Total persistent parameters: 996352 in 418 params
Time to load utils op: 0.0002448558807373047 seconds

strerr中的输出

Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.36s/it]
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.21s/it]
Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
ic| training_args.tune_visual_abstractor: True
ic| training_args.freeze_vision_model: True
ic| len(optimizer_grouped_parameters[0]['params']): 464
    len(optimizer_grouped_parameters[1]['params']): 91
Using :/usr/local/ninja as PyTorch extensions root...
Loading extension module utils...
Using :/usr/local/ninja as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...

  0%|          | 0/4412 [00:00<?, ?it/s]

等了3-5小时,还是0% 用的4卡A10

我是训了一半就卡住了.用的4090

兄弟你的输出信息也是这样么

ForeverTTE commented 10 months ago

参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗

strout中的输出

[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0
[2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters
[2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion  running, len(finish_tasks): [0]
loading dataset file :  DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None)
Formatting inputs...Skip in lazy mode
Time to load utils op: 11.517198085784912 seconds
Parameter Offload: Total persistent parameters: 996352 in 418 params
Time to load utils op: 0.0002448558807373047 seconds

strerr中的输出

Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.36s/it]
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00,  6.21s/it]
Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
ic| training_args.tune_visual_abstractor: True
ic| training_args.freeze_vision_model: True
ic| len(optimizer_grouped_parameters[0]['params']): 464
    len(optimizer_grouped_parameters[1]['params']): 91
Using :/usr/local/ninja as PyTorch extensions root...
Loading extension module utils...
Using :/usr/local/ninja as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...

  0%|          | 0/4412 [00:00<?, ?it/s]

等了一晚上小时,还是0% 用的4卡A10 全参微调和lora都是这样

same problem...

LianghuiGuo commented 10 months ago

解决了,我这边原因是数据集中图片路径的问题。路径改好了就OK

ForeverTTE commented 10 months ago

解决了,我这边原因是数据集中图片路径的问题。路径改好了就OK

感谢,我试试