Open LianghuiGuo opened 11 months ago
参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗
strout中的输出
[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters [2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion running, len(finish_tasks): [0] loading dataset file : DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None) Formatting inputs...Skip in lazy mode Time to load utils op: 11.517198085784912 seconds Parameter Offload: Total persistent parameters: 996352 in 418 params Time to load utils op: 0.0002448558807373047 seconds
strerr中的输出
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00, 6.36s/it] Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00, 6.21s/it] Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. ic| training_args.tune_visual_abstractor: True ic| training_args.freeze_vision_model: True ic| len(optimizer_grouped_parameters[0]['params']): 464 len(optimizer_grouped_parameters[1]['params']): 91 Using :/usr/local/ninja as PyTorch extensions root... Loading extension module utils... Using :/usr/local/ninja as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... 0%| | 0/4412 [00:00<?, ?it/s]
等了3-5小时,还是0% 用的4卡A10
我是训了一半就卡住了.用的4090
参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗 strout中的输出
[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters [2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion running, len(finish_tasks): [0] loading dataset file : DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None) Formatting inputs...Skip in lazy mode Time to load utils op: 11.517198085784912 seconds Parameter Offload: Total persistent parameters: 996352 in 418 params Time to load utils op: 0.0002448558807373047 seconds
strerr中的输出
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00, 6.36s/it] Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00, 6.21s/it] Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. ic| training_args.tune_visual_abstractor: True ic| training_args.freeze_vision_model: True ic| len(optimizer_grouped_parameters[0]['params']): 464 len(optimizer_grouped_parameters[1]['params']): 91 Using :/usr/local/ninja as PyTorch extensions root... Loading extension module utils... Using :/usr/local/ninja as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... 0%| | 0/4412 [00:00<?, ?it/s]
等了3-5小时,还是0% 用的4卡A10
我是训了一半就卡住了.用的4090
兄弟你的输出信息也是这样么
参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗
strout中的输出
[2023-12-04 15:00:43,681] [WARNING] [partition_parameters.py:823:_post_init_method] param `cls_token` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:43,747] [WARNING] [partition_parameters.py:823:_post_init_method] param `position_embedding` in MplugOwlVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:43,985] [WARNING] [partition_parameters.py:823:_post_init_method] param `query_embeds` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:44,003] [WARNING] [partition_parameters.py:823:_post_init_method] param `vit_eos` in MplugOwlVisualAbstractorModel not on GPU so was not broadcasted from rank 0 [2023-12-04 15:00:44,006] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 8.20B parameters [2023-12-04 15:00:45,618] [INFO] [AiScheduler#28] [executor.py:248] execute_fusion running, len(finish_tasks): [0] loading dataset file : DataArguments(data_path='./data/llava_v1_5_mix665k.json', lazy_preprocess=True, is_multimodal=True, image_folder='', image_aspect_ratio='pad', image_grid_pinpoints=None) Formatting inputs...Skip in lazy mode Time to load utils op: 11.517198085784912 seconds Parameter Offload: Total persistent parameters: 996352 in 418 params Time to load utils op: 0.0002448558807373047 seconds
strerr中的输出
Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00, 6.36s/it] Loading checkpoint shards: 100%|██████████| 33/33 [03:24<00:00, 6.21s/it] Some weights of MPLUGOwl2LlamaForCausalLM were not initialized from the model checkpoint at /data/oss_bucket_0/mplug_owl2 and are newly initialized: ['model.visual_abstractor.encoder.layers.1.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.3.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.4.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.k_pos_embed', 'model.visual_abstractor.encoder.layers.1.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.2.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.0.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.q_pos_embed', 'model.visual_abstractor.encoder.layers.5.crossattention.attention.k_pos_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. ic| training_args.tune_visual_abstractor: True ic| training_args.freeze_vision_model: True ic| len(optimizer_grouped_parameters[0]['params']): 464 len(optimizer_grouped_parameters[1]['params']): 91 Using :/usr/local/ninja as PyTorch extensions root... Loading extension module utils... Using :/usr/local/ninja as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... 0%| | 0/4412 [00:00<?, ?it/s]
等了一晚上小时,还是0% 用的4卡A10 全参微调和lora都是这样
same problem...
解决了,我这边原因是数据集中图片路径的问题。路径改好了就OK
解决了,我这边原因是数据集中图片路径的问题。路径改好了就OK
感谢,我试试
参考LLAVA的方式构建自己的数据集,数据量14w,按照官方给的方式进行finetune,发现进入trainer.train()之后就hang住了,strout和strerr也没有信息,没有报错,请问是哪里不对吗
strout中的输出
strerr中的输出
等了一晚上小时,还是0% 用的4卡A10 全参微调和lora都是这样