Adding visual and text encoders from CLIP for use in RoboTHOR ObjectNav via a new clip_plugin. Can install CLIP via the clip_plugin extra requirements.
Can invoke in "zeroshot" mode (where objects are split into seen/unseen sets and their names are encoded with CLIP's text encoder). Or, can just replace the visual encoder with CLIP's ResNet.
This pull request introduces 5 alerts when merging e0c2060300ab79bdcea8cd917232424d837d3620 into 9da8674e7781370b4c257eab707a613e953c002f - view on LGTM.com
Adding visual and text encoders from CLIP for use in RoboTHOR ObjectNav via a new
clip_plugin
. Can install CLIP via the clip_plugin extra requirements.Can invoke in "zeroshot" mode (where objects are split into seen/unseen sets and their names are encoded with CLIP's text encoder). Or, can just replace the visual encoder with CLIP's ResNet.
Training example