IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.95k stars 204 forks source link

MaskDino Fails to learn Precise Bounding Boxes on custom dataset but Dino does #242

Open FabianSchuetze opened 1 year ago

FabianSchuetze commented 1 year ago

Thanks for the wonderful repo. It's a pleasure to work with it and to read the code.

When training MaskDino on a custom dataset, the bounding box predictions are not very good. Interestingly:

Does anybody have an idea what I could tune to generate good bb results?

Training Details: I have slightly modified the training process (see this branch https://github.com/FabianSchuetze/detrex/tree/my_changes). I added amp training and have included some gradient checkpointing. I train with one GPU and a batch size of four (for MaskDino, Dino works with a batch size of 8). The learning rate is lowered linearly.

Data: The instances are very dense, similar to the "is-crowded" scenes of COCO. There is only one class. I have adjusted the num_objects in the config files.

Logs: Logs of the training runs are attached below. There are three logs:

Hyparameters: Comparing the parameters, the following aspects seem notable:

maskdino_0.4_noise_scale.txt maskdino_1.0_noise_scale.txt dino_log.txt

Does anybody have an idea how to debug the problem?

FabianSchuetze commented 1 year ago

To reproduce the results, I have used a public dataset with similar characteristics. In The COB-3D dataset, see: https://arxiv.org/abs/2210.07424 . I have extracted rgb images, bounding boxes, instance mask in the coco format. The dataset is a bit small (~6k images) and can be downloaded here. The original data is here. Please note that the data is published under the CC, non-comercial see https://github.com/wyndwarrior/autoregressive-bbox/blob/main/LICENSE .

An image of the predictions with maskdino and the gt are: image

The logs for dino and mask dino are uploaded below. dino.txt maskDino.txt

Interestingly:

HaoZhang534 commented 1 year ago

To reproduce the results, I have used a public dataset with similar characteristics. In The COB-3D dataset, see: https://arxiv.org/abs/2210.07424 . I have extracted rgb images, bounding boxes, instance mask in the coco format. The dataset is a bit small (~6k images) and can be downloaded here. The original data is here. Please note that the data is published under the CC, non-comercial see https://github.com/wyndwarrior/autoregressive-bbox/blob/main/LICENSE .

An image of the predictions with maskdino and the gt are: image

The logs for dino and mask dino are uploaded below. dino.txt maskDino.txt

Interestingly:

  • The bb mAP for dino is much better than for maskDino. The training is a bit short, but I noticed similar difference after longer training
  • However, when looking at the predictions, the visualized bbs for dino at not that much better. Both show a little bit of a low recall. I also uploaded the json predictions.
  • The results with standard Mask-RCNN heads are generally pretty good on this dataset. They have a good recall and good precision.

Hello, I notice that the boxes by maskdino are all shifted upper right a little bit. I guess there may be some bugs in the postprocessing code.

HaoZhang534 commented 1 year ago

@FabianSchuetze When you have relatively small datasets, Mask-RCNN usually can do good enough. MaskDINO and DINO are suitable for relatively large datasets such as COCO.

HaoZhang534 commented 1 year ago

@FabianSchuetze We fixed a bug in #249. Maybe you can run again to see if this solved your problem. Please also refer to the discussions in #247 .

FabianSchuetze commented 1 year ago

Thank you so much, @HaoZhang534 ! I will train the model again tomorrow and report back.

FabianSchuetze commented 1 year ago

@HaoZhang534 . I have worked wit the new commits but the bounding boxes are still shifted. I have commented again in #247 .

Furthermore, I am still not getting very good results. Maybe the training process is not really possible with just a batch size of 4? I will try to train on MS CoCo and see whether I can reproduce the original results. Can you maybe attach a log of the original training process? That would be wonderful & would make a comparison easier.