Feeding segmentation labels to CLIP

RozDavid / LanguageGroundedSemseg

Implementation for ECCV 2022 paper Language-Grounded Indoor 3D Semantic Segmentation in the Wild

98 stars 14 forks source link

Feeding segmentation labels to CLIP #23

Closed Amshaker closed 1 year ago

Amshaker commented 1 year ago

Hi @RozDavid,

Thank you for your exciting work.

Could you please share the code or let me know how do you feed the segmentation labels exactly to the CLIP text encoder to extract the text embeddings?

You just loaded the pickle files in your code, but I am interested to know how exactly you did it.

Thank you so much

RozDavid commented 1 year ago

Hey @Amshaker,

For this you could simply follow the installation/processing steps from the original CLIP repo. The basic idea is to load the ScanNet200 label as a standard list of strings and encode them with the frozen text encoder of CLIP. Please refer to their codebase for this, but here is a short example copied from their page.

  import torch
  import clip
  from PIL import Image

  device = "cuda" if torch.cuda.is_available() else "cpu"
  model, preprocess = clip.load("ViT-B/32", device=device)

  text = clip.tokenize(['wall', 'chair', 'floor', 'table', 'door', 'couch', ...]).to(device)

  with torch.no_grad():
      text_features = model.encode_text(text)

  print("Text features:", text_features)

Hope this helps! Cheers, David

Amshaker commented 1 year ago

Thank you @RozDavid for your reply.

One last question please, Do you pass to clip.tokenize the unique strings of each scan or pass the pixel-wise strings for all pixels?

For example, if there is a scan with a spatial shape of 1296 × 968 pixels with unique 10 labels. Do you just pass to clip.tokenize the 10 text labels or 1296 × 968 text labels that correspond to all pixels in a specific order?

Best regards, Abdelrahman.

RozDavid commented 1 year ago

Hey,

There is no need to pass the text labels to the CLIP encoder every iteration as they are constant during the training. This is why I precomputed these features and provided them in a simple python dict, which should be sufficient for the standard label set.

Also, based on your question I think it makes sense to clarify: in this paper we don't use images at any stage, so there is no need for pixel-level features. If you would want to use our language grounding on image segmentation, it definitely makes sense, and you could use those precomputed 10 unique features as anchors for the contrastive optimization.

Let me know if there is anything unclear, David

Amshaker commented 1 year ago

It is pretty clear.

Thank you for your valuable reply.

Best regards, Abdelrahman