Coobiw / MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
349 stars 19 forks source link

DeepSpeed的PP需要相同的seq-length(collate时注意padding)和batchsize(将dataloader的`drop_last`设为True) #25

Closed Youngluc closed 2 months ago

Youngluc commented 2 months ago

网上关于Deepspeed流水线并行的资料实在太少了...我遇到个问题需要请教一下,麻烦大佬有时间帮忙分析一下... 我仿照大佬的代码写了一个其他VLM的training code,在训练过程中会遇到一个奇怪的问题: setting如下:num_stages=4, ngpus_per_node=8,那么pp=4,dp=2,然后rank0和rank1会分别有两个batch:B1和B2,假设B1和B2的序列长度分别为N1和N2。 然后在autograd时候就出错了,说的是Mismatch shape错误,grad的shape为N1,output的shape为N2,相当于autograd时候用了rank1的batch B1去更新rank2的batch B2了。

这个BUG或者问题我实在无从下手解决,也没搜集到相关资料。

P.S.: 我在LLM的BlockPipeLayer中打印了一下,发现B1的数据完整的forward了所有层,B2的数据只forward了前20多个层,后面的层还没传播完。是不是哪里的同步有问题啊?

Youngluc commented 2 months ago

仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥?

Coobiw commented 2 months ago

hi,想先问两个问题: 1.序列处理是否是按照本库的处理,因为我的preprocess不会出现不等长序列 2.错误的log和你的输出截图方便提供一下吗

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 16:47 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Subscribed @.***> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25)

仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Youngluc commented 2 months ago

hi,想先问两个问题: 1.序列处理是否是按照本库的处理,因为我的preprocess不会出现不等长序列 2.错误的log和你的输出截图方便提供一下吗 ---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 16:47 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Subscribed @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25) 仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

hi感谢您的回复,

  1. 我用的InternVL里的序列处理方法,最终batch拿到的序列长度肯定是一样的,在preprocess和collator中已经处理完毕了。
  2. 最终报错为(其中一个,还有另一个一样的): RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2361, 6144]) and output[0] has a shape of torch.Size([4, 2481, 6144])

比较完整的错误报告(我打印了each layer的input_embeds.shape,还有我在collator中传入的一个tag【内容为input_ids.sum()与random.randint(100,2000)的一个拼接tensor】) `dynamic ViT batch size: 29, images per sample: 7.25, dynamic token length: 2262 [2024-06-21 16:40:50,826] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information [2024-06-21 16:40:50,826] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-06-21 16:40:50,826] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with None total layers [2024-06-21 16:40:50,826] [INFO] [checkpointing.py:543:forward] ----Synchronization False [2024-06-21 16:40:50,826] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor([712564646, 1668], device='cuda:0') dynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2361 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.0 cuda:1 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.1 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.2 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.3 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.4 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.5 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.6 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.7 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.8 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.9 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.13 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.10 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.14 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.11 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.15 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.12 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.16 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.17 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.18 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor([712564646, 1668], device='cuda:0') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.19 cuda:0 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor([712564646, 1668], device='cuda:0') [W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor([757955384, 1668], device='cuda:1') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor([757955384, 1668], device='cuda:1') [W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.20 cuda:3 tensor([757955384, 1668], device='cuda:3') tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.21 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor([757955384, 1668], device='cuda:3') tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.22 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor([757955384, 1668], device='cuda:3') tensor([712564646, 1668], device='cuda:2') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.23 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:2 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.24 cuda:3 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.25 cuda:3 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.26 cuda:3 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.27 cuda:3 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.28 cuda:3 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.29 cuda:3 tensor([712564646, 1668], device='cuda:2') tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.30 cuda:3 tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.31 cuda:3 tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.32 cuda:3 tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.33 cuda:3 tensor([757955384, 1668], device='cuda:3') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.34 cuda:5 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.35 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.36 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.37 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.38 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.39 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.40 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.41 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.42 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:4 tensor([757955384, 1668], device='cuda:5') tensor([712564646, 1668], device='cuda:4') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.43 cuda:5 tensor([757955384, 1668], device='cuda:5') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.44 cuda:5 tensor([757955384, 1668], device='cuda:5') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.45 cuda:5 tensor([757955384, 1668], device='cuda:5') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.46 cuda:5 tensor([757955384, 1668], device='cuda:5') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.47 cuda:5 tensor([757955384, 1668], device='cuda:5') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:6 tensor([712564646, 1668], device='cuda:6') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.48 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.49 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.50 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.51 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.52 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.53 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.54 cuda:7 tensor([757955384, 1668], device='cuda:7') torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.55 cuda:7 tensor([757955384, 1668], device='cuda:7') [W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [2024-06-21 16:41:08,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 10.02 | optimizer_gradients: 32.98 | optimizer_step: 89.77 0 batch end 0 batch end 0 batch end 0 batch end 0 batch end 0 batch end 0 batch end 0 batch end 06/21/2024 16:41:11 - INFO - main - {'loss': 2.441850185394287, 'learning_rate': 0.0, 'epoch': 0.0}

Epoch 1: 0%| | 0/15698 [00:26<?, ?it/s, loss=2.44, learning_rate=0, epoch=0]

Epoch 1: 0%| | 1/15698 [00:26<115:52:37, 26.58s/it, loss=2.44, learning_rate=0, epoch=0]dynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2481 torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.0 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.1 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.2 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.3 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.4 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.5 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.6 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.7 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.8 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.9 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.10 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.11 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.12 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor([766560813, 1374], device='cuda:1') torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor([766560813, 1374], device='cuda:1') Traceback (most recent call last): File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 850, in if name == 'main': File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 830, in main with torch.cuda.amp.autocast(dtype=torch.bfloat16, cache_enabled=False): File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch self._exec_schedule(sched) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in _exec_backward_pass torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 244, in backward gradtensors = _make_grads(tensors, gradtensors, is_grads_batched=False) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 88, in _make_grads raise RuntimeError( RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2361, 6144]) and output[0] has a shape of torch.Size([4, 2481, 6144]). dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2369 torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.13 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.14 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.15 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.16 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.17 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.18 cuda:0 tensor([790315515, 1374], device='cuda:0') torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.19 cuda:0 tensor([790315515, 1374], device='cuda:0')

Epoch 1: 0%| | 1/15698 [00:33<146:05:44, 33.51s/it, loss=2.44, learning_rate=0, epoch=0] Traceback (most recent call last): File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 850, in if name == 'main': File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 830, in main with torch.cuda.amp.autocast(dtype=torch.bfloat16, cache_enabled=False): File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch self._exec_schedule(sched) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in _exec_backward_pass torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 244, in backward gradtensors = _make_grads(tensors, gradtensors, is_grads_batched=False) File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 88, in _make_grads raise RuntimeError( RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2369, 6144]). torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:2 tensor([790315515, 1374], device='cuda:2')

Epoch: 0it [02:15, ?it/s] torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:2 tensor([790315515, 1374], device='cuda:2') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:4 tensor([790315515, 1374], device='cuda:4') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor([790315515, 1374], device='cuda:6') torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:6 tensor([790315515, 1374], device='cuda:6') [2024-06-21 16:41:30,578] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225308 [2024-06-21 16:41:30,578] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225309 [2024-06-21 16:41:31,133] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225310`

Youngluc commented 2 months ago

模型的一个pipeline切分: [2024-06-21 16:38:26,030] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method parameters stage=0 layers=21 0: TokenizerPipeLayer 1: InternLMBlockPipeLayer 2: InternLMBlockPipeLayer 3: InternLMBlockPipeLayer 4: InternLMBlockPipeLayer 5: InternLMBlockPipeLayer 6: InternLMBlockPipeLayer 7: InternLMBlockPipeLayer 8: InternLMBlockPipeLayer 9: InternLMBlockPipeLayer 10: InternLMBlockPipeLayer 11: InternLMBlockPipeLayer 12: InternLMBlockPipeLayer 13: InternLMBlockPipeLayer 14: InternLMBlockPipeLayer 15: InternLMBlockPipeLayer 16: InternLMBlockPipeLayer 17: InternLMBlockPipeLayer 18: InternLMBlockPipeLayer 19: InternLMBlockPipeLayer 20: InternLMBlockPipeLayer stage=1 layers=14 21: InternLMBlockPipeLayer 22: InternLMBlockPipeLayer 23: InternLMBlockPipeLayer 24: InternLMBlockPipeLayer 25: InternLMBlockPipeLayer 26: InternLMBlockPipeLayer 27: InternLMBlockPipeLayer 28: InternLMBlockPipeLayer 29: InternLMBlockPipeLayer 30: InternLMBlockPipeLayer 31: InternLMBlockPipeLayer 32: InternLMBlockPipeLayer 33: InternLMBlockPipeLayer 34: InternLMBlockPipeLayer stage=2 layers=14 35: InternLMBlockPipeLayer 36: InternLMBlockPipeLayer 37: InternLMBlockPipeLayer 38: InternLMBlockPipeLayer 39: InternLMBlockPipeLayer 40: InternLMBlockPipeLayer 41: InternLMBlockPipeLayer 42: InternLMBlockPipeLayer 43: InternLMBlockPipeLayer 44: InternLMBlockPipeLayer 45: InternLMBlockPipeLayer 46: InternLMBlockPipeLayer 47: InternLMBlockPipeLayer 48: InternLMBlockPipeLayer stage=3 layers=11 49: InternLMBlockPipeLayer 50: InternLMBlockPipeLayer 51: InternLMBlockPipeLayer 52: InternLMBlockPipeLayer 53: InternLMBlockPipeLayer 54: InternLMBlockPipeLayer 55: InternLMBlockPipeLayer 56: InternLMBlockPipeLayer 57: FLNPipeLayer 58: LMPipeLayer 59: LossPipeLayer

会发现序列长度为2369和2481的这两个序列,好像过了stage0(layer19)之后就被阻断了,然后这个grad_tensor和output之间的match关系也比较混乱。。。

Coobiw commented 2 months ago

在log里会有一个告诉你每个stage分在哪个rank上的输出 方便也share一下吗

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 17:03 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Coobiw @.>, Comment @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25)

模型的一个pipeline切分: [2024-06-21 16:38:26,030] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method parameters stage=0 layers=21 0: TokenizerPipeLayer 1: InternLMBlockPipeLayer 2: InternLMBlockPipeLayer 3: InternLMBlockPipeLayer 4: InternLMBlockPipeLayer 5: InternLMBlockPipeLayer 6: InternLMBlockPipeLayer 7: InternLMBlockPipeLayer 8: InternLMBlockPipeLayer 9: InternLMBlockPipeLayer 10: InternLMBlockPipeLayer 11: InternLMBlockPipeLayer 12: InternLMBlockPipeLayer 13: InternLMBlockPipeLayer 14: InternLMBlockPipeLayer 15: InternLMBlockPipeLayer 16: InternLMBlockPipeLayer 17: InternLMBlockPipeLayer 18: InternLMBlockPipeLayer 19: InternLMBlockPipeLayer 20: InternLMBlockPipeLayer stage=1 layers=14 21: InternLMBlockPipeLayer 22: InternLMBlockPipeLayer 23: InternLMBlockPipeLayer 24: InternLMBlockPipeLayer 25: InternLMBlockPipeLayer 26: InternLMBlockPipeLayer 27: InternLMBlockPipeLayer 28: InternLMBlockPipeLayer 29: InternLMBlockPipeLayer 30: InternLMBlockPipeLayer 31: InternLMBlockPipeLayer 32: InternLMBlockPipeLayer 33: InternLMBlockPipeLayer 34: InternLMBlockPipeLayer stage=2 layers=14 35: InternLMBlockPipeLayer 36: InternLMBlockPipeLayer 37: InternLMBlockPipeLayer 38: InternLMBlockPipeLayer 39: InternLMBlockPipeLayer 40: InternLMBlockPipeLayer 41: InternLMBlockPipeLayer 42: InternLMBlockPipeLayer 43: InternLMBlockPipeLayer 44: InternLMBlockPipeLayer 45: InternLMBlockPipeLayer 46: InternLMBlockPipeLayer 47: InternLMBlockPipeLayer 48: InternLMBlockPipeLayer stage=3 layers=11 49: InternLMBlockPipeLayer 50: InternLMBlockPipeLayer 51: InternLMBlockPipeLayer 52: InternLMBlockPipeLayer 53: InternLMBlockPipeLayer 54: InternLMBlockPipeLayer 55: InternLMBlockPipeLayer 56: InternLMBlockPipeLayer 57: FLNPipeLayer 58: LMPipeLayer 59: LossPipeLayer 会发现序列长度为2369和2481的这两个序列,好像过了stage1之后就被阻断了,然后这个grad_tensor和output之间的match关系也比较混乱。。。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Coobiw commented 2 months ago

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

Youngluc commented 2 months ago

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

对的,stage划分是1357 0246这样,collator最后返回的是按照大佬博客里讲的Tuple[Tuple[torch.Tensor], Any]这样的形式,我仔细研究了一下,测试的时候发现是,出错的那几个batch,activation在从GPU0(GPU1)转移到GPU2(GPU3)的时候,出现问题,变成这个rank上个step的batch的序列形状了。我举个例子: 比如在rank1(GPU0246)上,第一个batch的input_embeds是(4,2262,6144)的形状,第二个batch是(4,2361,6144)的形状。第一个batch的forward和backward是正常的,而第二个batch在GPU0上是正常的,GPU0上每个layer的hidden_states形状都是(4,2361,6144),但是到GPU2上的layer时,形状就都变成(4,2262,6144)了,然后就会报错: RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2361, 6144]). 这是实际情况,我很纳闷为什么会这样。我改变了pp的等级(pp=2/4/8),都是上述这种情况,第一个step都是正常的,第二个开始就会这样了。大佬可以帮忙分析一下为什么吗?非常感谢!!!(如果太占用大佬时间就算惹

Youngluc commented 2 months ago

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。

Coobiw commented 2 months ago

方便看下你block的代码吗,或者你知乎私我个联系方式 有空一起看一下子?

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 23:39 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Coobiw @.>, Comment @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25)

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似: def collate_fn_minigpt4qwen(batch,preprocess_func): image_list, conversation_list = [], []

for sample in batch:
    image_list.append(sample["image"])
    conversation_list.append(sample["conversations"])

new_batch = \
    {
        "image": torch.stack(image_list, dim=0),
        "conversations": conversation_list,
    }
data_dict = preprocess_func(new_batch['conversations'])

return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
            data_dict['labels']
    ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477 训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Youngluc commented 2 months ago

方便看下你block的代码吗,或者你知乎私我个联系方式 有空一起看一下子? ---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 23:39 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Coobiw @.>, Comment @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25) 感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似: def collate_fn_minigpt4qwen(batch,preprocess_func): image_list, conversation_list = [], [] for sample in batch: image_list.append(sample["image"]) conversation_list.append(sample["conversations"]) new_batch = \ { "image": torch.stack(image_list, dim=0), "conversations": conversation_list, } data_dict = preprocess_func(new_batch['conversations']) return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']), data_dict['labels'] ) # 我这里是Tuple[Tuple[Tensor], Tensor] 可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477 训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

好好好,我知乎私聊大佬!代码的话是在办公笔记本上我没法直接复制粘贴😭

Coobiw commented 2 months ago

Solved.

We find that DeepSpeed Pipeline Parallel needs the same seq_length in a mini-batch(including many micro-batch) and the same batch-size(so we should set drop_last to True).

This is a good discovery. I'll close but pin this issue.