dddzg / up-detr

[TPAMI 2022 & CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers
Apache License 2.0
476 stars 71 forks source link

How to support batch learning for one-shot object detection training? #24

Closed JosephAssaker closed 2 years ago

JosephAssaker commented 2 years ago

So in the paper you suggest training UP-DETR for the task of one-shot object detection and provided interesting results on VOC.

As you don't seem to provide any code in this github related to the one-shot object detection training (please correct me if I'm mistaken), I tried to implement it myself. That being said, I confronted an obstacle when it came to supporting batch learning. This is because, if we have a minibatch of N target images, each of them will have a corresponding query patch, so a total of N query patches in this minibatch. How would you apply GAP and add the features of these N query patches to the object queries in the decoder? It doesn't seem to me that adding the features of the ith query patch to the object queries while forwarding a batch containing the jth target image through the decoder (where the jth target image isn't related to the ith query object) is the correct thing to do.

So, my question is, were you able to support batch learning for one-shot object detection? If so, how?

dddzg commented 2 years ago

Each query patch will be added to all object queries.

The code of one-shot detection is different from the repo now. So, we did not publish.

JosephAssaker commented 2 years ago

So you would compute the GAP on all the N query patches features within a single minibatch, add them all up together and add this sum to the object queries? If so, how would you interpret this implementation? And how would it make sense to transition in inferencing to adding up just the the features of the one query patch of interest for this one target image we're testing?

dddzg commented 2 years ago

I think there are some misunderstandings. Let N = query patches, which is also the batch size. M is object queries. We will get (N,D) query patch feature, and then we will repeat it M times. In other words, we will get (N,M,D) query patch feature. And M object queries will repeat N times, then there are added together.

It means for each query patch, it will be added to all object queries. Object queries can be treated as possible positions of the query patch.

JosephAssaker commented 2 years ago

Thank you for your engagement in this issue!

However I didn't really get what And M object queries will repeat N times, then there are added together actually means.

To my understanding, after you get the (N,D) query patch features, you sum all of them up on the first dimension to end up with a (1,D) query patch features, then repeat that M times to get a (1,M,D) query patch feature that can be directly added to the (M,D) object queries, right? And after that you can perform a single forward pass for all images within the minibatch.

If on the other hand you meant that you would repeat the decoder forward pass N times, than that would mean that you'd have a batch size of 1 in the decoder, rather than N.

dddzg commented 2 years ago

Your understand is not correct. We did not sum all of them up on the first dim.

query patch feature: (N,D).unsqueeze(1)=> (N,1,D).repeat(1,M,1) => (N,M,D) object query: (M,D).unsqueeze(0)=> (1,M,D).repeat(N,1,1) => (N,M,D)

then they are added.

N=batch size =1 could work.

dddzg commented 2 years ago

image There is a illustration for one-shot detection. I hope it could help you understand it.

JosephAssaker commented 2 years ago

Hello again!

Sorry for the late reply, but I finally got the chance to re-implement my solution given your feedback and all works great! I haven't realized that in the original DETR architecture the object queries matrix was being repeated N times (N being the minibatch size) in the forward pass! For some reason I was convinced that It was forwarded as a (M,D) dimensional matrix.

In any case thank you for your help!

I would just have one last, not directly related, question about the paper's results: It is mentioned that DETR achieves 57.3% AP for unseen classes whereas UP-DETR achieves 73.1%. However it is not clear what DETR refers to? Does "DETR" here refer to UP-DETR's same architecture but without the unsupervised pre-training?

dddzg commented 2 years ago

Does "DETR" here refer to UP-DETR's same architecture but without the unsupervised pre-training?

Yes.