antoyang / TubeDETR

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers
Apache License 2.0
167 stars 8 forks source link

Any plan on applying it to Action tube detection #6

Closed gurkirt closed 2 years ago

gurkirt commented 2 years ago

Hi great work!

Thanks for sharing the code. Do you have any plan to apply it on the action tube detection problem? I guess we have to strip off text encoder.

Best Gurkirt

antoyang commented 2 years ago

Hi, I do not plan to apply it to action tube detection, but this is a very relevant problem! Yes, you would have to strip off the text encoder. Also, maybe a different pretraining than the one used in our work should be used.

gurkirt commented 2 years ago

Thanks for the reply. https://arxiv.org/abs/2104.00969 looks similar to yours as well. Can you point out major difference to this one?

antoyang commented 2 years ago

This is indeed relevant related work! I would actually say that the main differences are related to the specific task each of the works tackles: a task with natural language inputs in our case, and a task without natural language input but that requires predicting an action label in their case.