IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
2k stars 206 forks source link

`dino_eva_02_vitdet_*_1024_*` configs throw tensor shape mismatch error #356

Open dgcnz opened 2 months ago

dgcnz commented 2 months ago

Description

Tested all dino_eva_02_vitdet models from here and the models with image_size=1024 seem to be failing.

Used this image from the installation tutorial.

Working:

Not working:

Looking at the logs, the culprit seems to be this snippet of code: https://github.com/IDEA-Research/detrex/blob/03e02cb3182112569724092fc1c6935b61d54141/projects/dino_eva/modeling/dino.py#L529-L532

Log info example

Command

python demo/demo.py --config-file projects/dino_eva/configs/dino-eva-02/dino_eva_02_vitdet_b_4attn_1024_lrd0p7_4scale_12ep.py \
                    --input idea.jpg \
                    --output visualized_results_eva_no_window_gpu.jpg \
                    --opts train.init_checkpoint="dino_eva_02_in21k_pretrain_vitdet_b_4attn_1024_lrd0p7_4scale_12ep.pth"

Logs

[07/10 11:00:20 detectron2]: Arguments: Namespace(config_file='projects/dino_eva/configs/dino-eva-02/dino_eva_02_vitdet_b_4attn_1024_lrd0p7_4scale_12ep.py', webcam=False, video_input=None, input=['idea.jpg'], output='visualized_results_eva_no_window_gpu.jpg', min_size_test=800, max_size_test=1333, img_format='RGB', metadata_dataset='coco_2017_val', confidence_threshold=0.5, opts=['train.init_checkpoint=dino_eva_02_in21k_pretrain_vitdet_b_4attn_1024_lrd0p7_4scale_12ep.pth'])
======== shape of rope freq torch.Size([256, 64]) ========
======== shape of rope freq torch.Size([4096, 64]) ========
[07/10 11:00:24 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from dino_eva_02_in21k_pretrain_vitdet_b_4attn_1024_lrd0p7_4scale_12ep.pth ...
[07/10 11:00:24 fvcore.common.checkpoint]: [Checkpointer] Loading from dino_eva_02_in21k_pretrain_vitdet_b_4attn_1024_lrd0p7_4scale_12ep.pth ...
  0% 0/1 [00:00<?, ?it/s]/content/detrex/./projects/dino_eva/modeling/dino.py:530: UserWarning: square_size=1024, is smaller than max_size=1199 in batch
  warnings.warn("square_size={}, is smaller than max_size={} in batch".format(
  0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/detrex/demo/demo.py", line 141, in <module>
    predictions, visualized_output = demo.run_on_image(img, args.confidence_threshold)
  File "/content/detrex/./demo/predictors.py", line 80, in run_on_image
    predictions = self.predictor(image)
  File "/content/detrex/./demo/predictors.py", line 207, in __call__
    predictions = self.model([inputs])[0]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/detrex/./projects/dino_eva/modeling/dino.py", line 198, in forward
    features = self.backbone(images.tensor)  # output feature dict
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/detrex/./detrex/modeling/backbone/eva.py", line 583, in forward
    bottom_up_features = self.net(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/detrex/./detrex/modeling/backbone/eva_02.py", line 431, in forward
    x = blk(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/detrex/./detrex/modeling/backbone/eva_02.py", line 275, in forward
    x = self.attn(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/detrex/./detrex/modeling/backbone/eva_02.py", line 117, in forward
    q = self.rope(q).type_as(v)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/detrex/./detrex/modeling/backbone/eva_02_utils.py", line 349, in forward
    return  t * self.freqs_cos + rotate_half(t) * self.freqs_sin
RuntimeError: The size of tensor a (5476) must match the size of tensor b (4096) at non-singleton dimension 2
dgcnz commented 2 months ago

Commenting line 532 in the snippet above silences the error but results in bounding boxes with a vertical offset:

visualized_results_eva_no_window_gpu

dgcnz commented 2 months ago

Okay, it seems that there is some padding happening somewhere that messes up predictions. If I manually resize the image to have square dimensions, then everything works as expected.

visualized_results_eva_no_window_gpu