How to recover text grounding from visual encoder

Hey, thx so much for sharing this repo! since r3m is trained via contrastive learning, it should have learned to align the visual representation to text embeddings. So based on this, I do wonder if is there any efficient approach that when using r3m, for a given visual representation, we can further decode its textual grounding. I think one approach is to use a pre-trained captioning model to generate captions, then further infer the description, what do you think of it?

facebookresearch / r3m

How to recover text grounding from visual encoder #27