MaverickRen / PixelLM

PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.
Apache License 2.0
178 stars 5 forks source link

Bug in `overlap_loss` #24

Open zhixuanli opened 2 months ago

zhixuanli commented 2 months ago

Dear authors,

During training, in the first iteration, the following bug occurred which interrupted the training. Could you please take a look and give some suggestions? Thanks!

Token indices sequence length is longer than the specified maximum sequence length for this model (851 > 512). Running this sequence through the model will result in indexing errors
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/lizhixuan/reasoning/PixelLM_Learning/train_ds.py", line 972, in <module>
[rank0]:     main(sys.argv[1:])
[rank0]:   File "/home/lizhixuan/reasoning/PixelLM_Learning/train_ds.py", line 520, in main
[rank0]:     train_iter = train(
[rank0]:   File "/home/lizhixuan/reasoning/PixelLM_Learning/train_ds.py", line 616, in train
[rank0]:     output_dict = model(**input_dict)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/peft/peft_model.py", line 922, in forward
[rank0]:     return self.base_model(
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/lizhixuan/reasoning/PixelLM_Learning/model/PixelLM.py", line 323, in forward
[rank0]:     return self.model_forward(**kwargs)
[rank0]:   File "/home/lizhixuan/reasoning/PixelLM_Learning/model/PixelLM.py", line 602, in model_forward
[rank0]:     #         overlap_loss(pred_mask, gt_mask, gt_mask.shape[0], batch_seg_token_count)
[rank0]:   File "/home/lizhixuan/reasoning/PixelLM_Learning/model/PixelLM.py", line 78, in overlap_loss
[rank0]:     assert end_i <= len(targets), (targets.shape, batch_seg_token_count)
[rank0]: AssertionError: (torch.Size([13, 375, 500]), tensor([ 0, 16, 32, 52], device='cuda:0'))

Thank you!

zhixuanli commented 2 months ago

Although I have tried not to use the overlap_loss, this bug exists.