IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/blog/T-Rex
Other
2.26k stars 146 forks source link

why not implement T-Rex2 along the lines of grounding DINO #53

Closed fuweifu-vtoo closed 5 months ago

fuweifu-vtoo commented 6 months ago

Why does the language encoder not use bert like grounding DINO, but use CLIP? My question is, why not implement T-Rex2 along the lines of grounding DINO?

Mountchicken commented 6 months ago

Hi @fuweifu-vtoo Sorry for the late reply. Essentially the Text Encoder in CLIP is also a BERT, so it's not much different from Grounding DINO. The difference lies in how these two models model the text prompt. Grounding DINO feeds the whole sentence into BERT, and then takes the embedding of the corresponding text as the representation. In T-Rex2, we use Phrase as the input to BERT, and only take the output of CLS TOKEN as the text representation. As for the reason, since T-Rex2 is a late fusion structure, i.e., the text embedding will not interact with the image features, and will only compute the similarity with the query at the final output layer, we want to make this text representation as simple as possible, i.e., no matter how long the input phrases are, we just want to represent them as one embedding.

fuweifu-vtoo commented 6 months ago

Thank you for your detailed explanation~ Another question I also hope to get your answer is, does T-Rex2 freeze the CLIP text encoder during training?

fuweifu-vtoo commented 6 months ago

Also, how long does it take to train a T-Rex2 with swin Transformer tiny model on 16 NVIDIA A100 GPUs with a total batch size of 128?

Mountchicken commented 6 months ago

The CLIP text encoder is not frozen during the training process. It takes around 3 days to train a Swin-T model

Baboom-l commented 5 months ago

@Mountchicken How many iters did Trex v2 T train? Three days is much shorter than my estimated time.

Mountchicken commented 5 months ago

T-Rex2 has gone through multiple rounds of training. We will first train the text prompt and then load this weights before training with visual prompt. The last training phase took about 3 days and 100000 iterations were trained.

fuweifu-vtoo commented 5 months ago

Hi, how long did the first phase of training take?

Mountchicken commented 5 months ago

Nearly three days on 8xA100