During training, in the first iteration, the following bug occurred which interrupted the training. Could you please take a look and give some suggestions? Thanks!
Token indices sequence length is longer than the specified maximum sequence length for this model (851 > 512). Running this sequence through the model will result in indexing errors
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lizhixuan/reasoning/PixelLM_Learning/train_ds.py", line 972, in <module>
[rank0]: main(sys.argv[1:])
[rank0]: File "/home/lizhixuan/reasoning/PixelLM_Learning/train_ds.py", line 520, in main
[rank0]: train_iter = train(
[rank0]: File "/home/lizhixuan/reasoning/PixelLM_Learning/train_ds.py", line 616, in train
[rank0]: output_dict = model(**input_dict)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/peft/peft_model.py", line 922, in forward
[rank0]: return self.base_model(
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/lizhixuan/anaconda3/envs/pixellm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/lizhixuan/reasoning/PixelLM_Learning/model/PixelLM.py", line 323, in forward
[rank0]: return self.model_forward(**kwargs)
[rank0]: File "/home/lizhixuan/reasoning/PixelLM_Learning/model/PixelLM.py", line 602, in model_forward
[rank0]: # overlap_loss(pred_mask, gt_mask, gt_mask.shape[0], batch_seg_token_count)
[rank0]: File "/home/lizhixuan/reasoning/PixelLM_Learning/model/PixelLM.py", line 78, in overlap_loss
[rank0]: assert end_i <= len(targets), (targets.shape, batch_seg_token_count)
[rank0]: AssertionError: (torch.Size([13, 375, 500]), tensor([ 0, 16, 32, 52], device='cuda:0'))
Dear authors,
During training, in the first iteration, the following bug occurred which interrupted the training. Could you please take a look and give some suggestions? Thanks!
Thank you!