Open StarCycle opened 7 months ago
Hi @StarCycle ,
Thanks for your attention to this work. GR-1 uses MAE and CLIP for encoding images and texts. MAE and CLIP possess strong representation capabilities which provide good image and text representations for GR-1. Also, GR-1 uses a GPT-style transformer which gracefully handle sequences of signals from different modalities, i.e., images, text, and states. All these contributes to the strong performance of the backbone. As for the perceiver resampler, it is used for reducing the number of image tokens.
@hongtaowu67 Thank you for the response! What kind of GPU and how many GPU hours are needed to train this model?
Hi @hongtaowu67,
You showed the performance without pretrain and video prediction in the appendix, i.e., the following table:
Note that the performance is much stronger than other baselines, including MT-R3M pretrained on 5M images of Ego4D:
Do you have any clue about the strong performance of the backbone? Perhaps selecting the CLS token of ViT and using a perceiver to process other tokens will improve the network significantly?
It's a nice work and I hope to see its open-source version soon!
Best, StarCycle