facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.08k stars 2.37k forks source link

Big misunderstanding on DETR #589

Open Alan-D-Chen opened 1 year ago

Alan-D-Chen commented 1 year ago

❓ How to do something using DETR

Dear Pro. Nicolas Carion,

Hi, I am Dong Chen, a PHD. student in China. DETR is a very famous and important article that plays a fundamental role in the industry. But there was a huge controversy surrounding this article. There is a big controversial issues in my team and my reviewing.

In my paper, the question I want to explore is why the application of transformer block has a negative impact on the performance of small object detection compared to CNN layers. So I conducted multiple experiments (increasing or decreasing the level of CNN or the number of transformer blocks in DETR). I want to try to explain that CNN and Transformer have different feature extraction mechanisms for object detection.

But in this reviewing, a reviewer give me a new idea from the words in yellow lines. The reviewer try to tell me that DETR adopts different processing schemes for targets (or feature maps) of different sizes. Like this:

Large or medium objects(feature)—> CNN layers —>detector head Small objects(feature)—> CNN layers —> transformer blocks —>detector head

Maybe, the reviewer think CNN and Transformer have different feature extraction mechanisms, and multi-scale feature map interaction in CNN-based or Transformer-based models can improve the small object detection result. But I just want to know the influence of more or less CNN and Transformer on feature extraction mechanisms. (I even do not mention the multi-scale feature map).

We need your help to clear the problem. I am looking forward to your answer

Best wishes for you.

Chen Dong from China Shanghai

Alan Chen

alan_chen@tongji.edu.cn Alan D.Chen Ph. D Student

Describe what you want to do, including:

  1. what inputs you will provide, if any:
  2. what outputs you are expecting:

NOTE:

  1. Only general answers are provided. If you want to ask about "why X did not work", please use the Unexpected behaviors issue template.

  2. About how to implement new models / new dataloader / new training logic, etc., check documentation first.

  3. We do not answer general machine learning / computer vision questions that are not specific to DETR, such as how a model works, how to improve your training/make it converge, or what algorithm/methods can be used to achieve X.