Closed Amshaker closed 1 year ago
Hey @Amshaker,
For this you could simply follow the installation/processing steps from the original CLIP repo. The basic idea is to load the ScanNet200 label as a standard list of strings and encode them with the frozen text encoder of CLIP. Please refer to their codebase for this, but here is a short example copied from their page.
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
text = clip.tokenize(['wall', 'chair', 'floor', 'table', 'door', 'couch', ...]).to(device)
with torch.no_grad():
text_features = model.encode_text(text)
print("Text features:", text_features)
Hope this helps! Cheers, David
Thank you @RozDavid for your reply.
One last question please, Do you pass to clip.tokenize the unique strings of each scan or pass the pixel-wise strings for all pixels?
For example, if there is a scan with a spatial shape of 1296 × 968 pixels with unique 10 labels. Do you just pass to clip.tokenize the 10 text labels or 1296 × 968 text labels that correspond to all pixels in a specific order?
Best regards, Abdelrahman.
Hey,
There is no need to pass the text labels to the CLIP encoder every iteration as they are constant during the training. This is why I precomputed these features and provided them in a simple python dict, which should be sufficient for the standard label set.
Also, based on your question I think it makes sense to clarify: in this paper we don't use images at any stage, so there is no need for pixel-level features. If you would want to use our language grounding on image segmentation, it definitely makes sense, and you could use those precomputed 10 unique features as anchors for the contrastive optimization.
Let me know if there is anything unclear, David
It is pretty clear.
Thank you for your valuable reply.
Best regards, Abdelrahman
Hi @RozDavid,
Thank you for your exciting work.
Could you please share the code or let me know how do you feed the segmentation labels exactly to the CLIP text encoder to extract the text embeddings?
You just loaded the pickle files in your code, but I am interested to know how exactly you did it.
Thank you so much