extra LN at the end of encoder is removed

JacobYuan7 commented 2 years ago

Thx for your great work. But I find it a bit confusing about the annotation in transformer.py, stating that "extra LN at the end of encoder is removed". I am not able to find the corresponding code changes for this. I think it is https://github.com/ashkamath/mdetr/blob/bf09d98b0b41cd615185dcb0082299a5ba24c319/models/transformer.py#L180-L185, but it seems that it's identical to the original DETR.

ashkamath commented 2 years ago

Hey, Our code base is an extension of the DETR code base so it is expected that many things are identical to the original DETR. The comment at the top states the difference with respect to torch’s transformer class.

Best, Aishwarya

On Sat, Nov 20, 2021 at 3:03 AM JacobYuan7 @.***> wrote:

Thx for your great work. But I find it a bit confusing about the annotation in transformer.py, stating that "extra LN at the end of encoder is removed". I am not able to find the corresponding code changes for this. I think it is https://github.com/ashkamath/mdetr/blob/bf09d98b0b41cd615185dcb0082299a5ba24c319/models/transformer.py#L185, but it seems that it's identical to the original DETR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ashkamath/mdetr/issues/61, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQZNM4ZA7D3YSZVDL33GVLUM5I3VANCNFSM5INV6JGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Aishwarya Kamath PhD Student New York University

JacobYuan7 commented 2 years ago

@ashkamath Thx so much for your reply. As I am reading your paper, I raise a question about detection on the LVIS dataset. The paper suggests that it will take about 10s to detect on a single image. As I understand it, we traverse all the class names and input one class name at a time as the text feature, which results in detecting on the same image 1,200 times. Why don't we input as many class names as possible at one time? It will certainly reduce the amount of time detecting a single image. Looking forward to your kind reply.

ashkamath commented 2 years ago

That would reduce the time, yes but could also hurt performance. But recently the GLIP paper (https://arxiv.org/abs/2112.03857) seems to take this approach and it does well, so maybe its worth a try :)

ashkamath / mdetr

extra LN at the end of encoder is removed #61