IDEA-Research / T-Rex

API for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/home
Other
1.98k stars 120 forks source link

why not implement T-Rex2 along the lines of grounding DINO #53

Closed fuweifu-vtoo closed 1 month ago

fuweifu-vtoo commented 1 month ago

Why does the language encoder not use bert like grounding DINO, but use CLIP? My question is, why not implement T-Rex2 along the lines of grounding DINO?

Mountchicken commented 1 month ago

Hi @fuweifu-vtoo Sorry for the late reply. Essentially the Text Encoder in CLIP is also a BERT, so it's not much different from Grounding DINO. The difference lies in how these two models model the text prompt. Grounding DINO feeds the whole sentence into BERT, and then takes the embedding of the corresponding text as the representation. In T-Rex2, we use Phrase as the input to BERT, and only take the output of CLS TOKEN as the text representation. As for the reason, since T-Rex2 is a late fusion structure, i.e., the text embedding will not interact with the image features, and will only compute the similarity with the query at the final output layer, we want to make this text representation as simple as possible, i.e., no matter how long the input phrases are, we just want to represent them as one embedding.

fuweifu-vtoo commented 1 month ago

Thank you for your detailed explanation~ Another question I also hope to get your answer is, does T-Rex2 freeze the CLIP text encoder during training?

fuweifu-vtoo commented 1 month ago

Also, how long does it take to train a T-Rex2 with swin Transformer tiny model on 16 NVIDIA A100 GPUs with a total batch size of 128?

Mountchicken commented 1 month ago

The CLIP text encoder is not frozen during the training process. It takes around 3 days to train a Swin-T model

Baboom-l commented 1 month ago

@Mountchicken How many iters did Trex v2 T train? Three days is much shorter than my estimated time.

Mountchicken commented 1 month ago

T-Rex2 has gone through multiple rounds of training. We will first train the text prompt and then load this weights before training with visual prompt. The last training phase took about 3 days and 100000 iterations were trained.

fuweifu-vtoo commented 4 weeks ago

Hi, how long did the first phase of training take?

Mountchicken commented 3 weeks ago

Nearly three days on 8xA100