GR-1 is much stronger than other baselines even without pretrain

StarCycle commented 7 months ago

Hi @hongtaowu67,

You showed the performance without pretrain and video prediction in the appendix, i.e., the following table:

8f90408b1f2bec3a613a07b38617c7b

Note that the performance is much stronger than other baselines, including MT-R3M pretrained on 5M images of Ego4D:

3cb2c51543974161ebbb9b23ad526ac

In the "ABCD->D" experiment, GR-1 without pretrain can achieve average length of 3.33, higher than 3.06 of HULC, 2.08 of MT-R3M, and 2.45 of RT-1, respectively.
In the "ABC->D" experiment, GR-1 without pretrain can achieve average length of 2.4, 0.93 of MT-R3M, 0.9 of RT-1, 0.67 of HULC.
In the "10% data" experiment, GR-1 without pretrain can achieve average length of 1.04, slightly lower than 1.11 of HULC but higher than all other methods.

Do you have any clue about the strong performance of the backbone? Perhaps selecting the CLS token of ViT and using a perceiver to process other tokens will improve the network significantly?

It's a nice work and I hope to see its open-source version soon!

Best, StarCycle

hongtaowu67 commented 5 months ago

Hi @StarCycle ,

Thanks for your attention to this work. GR-1 uses MAE and CLIP for encoding images and texts. MAE and CLIP possess strong representation capabilities which provide good image and text representations for GR-1. Also, GR-1 uses a GPT-style transformer which gracefully handle sequences of signals from different modalities, i.e., images, text, and states. All these contributes to the strong performance of the backbone. As for the perceiver resampler, it is used for reducing the number of image tokens.

StarCycle commented 5 months ago

@hongtaowu67 Thank you for the response! What kind of GPU and how many GPU hours are needed to train this model?

GR1-Manipulation / GR-1

GR-1 is much stronger than other baselines even without pretrain #2