Closed fuweifu-vtoo closed 5 months ago
Hi @fuweifu-vtoo Sorry for the late reply. Essentially the Text Encoder in CLIP is also a BERT, so it's not much different from Grounding DINO. The difference lies in how these two models model the text prompt. Grounding DINO feeds the whole sentence into BERT, and then takes the embedding of the corresponding text as the representation. In T-Rex2, we use Phrase as the input to BERT, and only take the output of CLS TOKEN as the text representation. As for the reason, since T-Rex2 is a late fusion structure, i.e., the text embedding will not interact with the image features, and will only compute the similarity with the query at the final output layer, we want to make this text representation as simple as possible, i.e., no matter how long the input phrases are, we just want to represent them as one embedding.
Thank you for your detailed explanation~ Another question I also hope to get your answer is, does T-Rex2 freeze the CLIP text encoder during training?
Also, how long does it take to train a T-Rex2 with swin Transformer tiny model on 16 NVIDIA A100 GPUs with a total batch size of 128?
The CLIP text encoder is not frozen during the training process. It takes around 3 days to train a Swin-T model
@Mountchicken How many iters did Trex v2 T train? Three days is much shorter than my estimated time.
T-Rex2 has gone through multiple rounds of training. We will first train the text prompt and then load this weights before training with visual prompt. The last training phase took about 3 days and 100000 iterations were trained.
Hi, how long did the first phase of training take?
Nearly three days on 8xA100
Why does the language encoder not use bert like grounding DINO, but use CLIP? My question is, why not implement T-Rex2 along the lines of grounding DINO?