MCG-NJU / MixFormer

[CVPR 2022 Oral & TPAMI 2024] MixFormer: End-to-End Tracking with Iterative Mixed Attention
https://arxiv.org/abs/2203.11082
MIT License
457 stars 75 forks source link

MixViT CovMAE #103

Open samueleruffino99 opened 10 months ago

samueleruffino99 commented 10 months ago

Hello, I have seen that you reference to the ConvMAE pertained based method as MixViT-COnvMAE, but actually, looking at your implementation the backbone is much more similar to the MixCvT layout, with multiple patch embedding and blocks. Am I missing something or could be? Because I am trying to adapt PiMAE as you have done with the ConvMAE model, thank you!

Moreover, I have seen that during training, you are passing templates and search tokes to the same backbone multiple times, how the training procedure deal with it? Because I would like to enrich your model with some kind of notion about hand trajectory (when tracked object is handled or similar).

yutaocui commented 10 months ago

In terms of the patch embeding style, the MixViT-ConvMAE is more like MixCvT, so you are ture. For the second question, I don't know what you means, can you give detailed explanation.