facebookresearch / r3m

Pre-training Reusable Representations for Robotic Manipulation Using Diverse Human Video Data
https://sites.google.com/view/robot-r3m/
MIT License
292 stars 45 forks source link

How to recover text grounding from visual encoder #27

Open zhouliang-yu opened 1 year ago

zhouliang-yu commented 1 year ago

Hey, thx so much for sharing this repo! since r3m is trained via contrastive learning, it should have learned to align the visual representation to text embeddings. So based on this, I do wonder if is there any efficient approach that when using r3m, for a given visual representation, we can further decode its textual grounding. I think one approach is to use a pre-trained captioning model to generate captions, then further infer the description, what do you think of it?