Text prompt embedding size

facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

45.77k stars 5.42k forks source link

Text prompt embedding size #19

Open surmenok opened 1 year ago

surmenok commented 1 year ago

Thank you for releasing the model. The paper mentions that text prompts are encoded using a pretrained ViT-L/14@336px CLIP model. CLIP embedding from this model are of size 768 while SegmentAnything prompt embedding is of size 256. Are there any extra steps for converting CLIP embeddings before feeding into the model?

visheratin commented 1 year ago

These are point and bounding box embedding sizes. The released code doesn't have text prompting. FAQ on the site says: "Text prompts are explored in our paper but the capability is not released".

Ankur-singh commented 1 year ago

Yes, if you go through the code (in particular prompt_encoder.py) you will see that only point and bbox embeddings are supported in Sparse embedding.

Any plans of releasing text prompt support in the near future?

aniki-ly commented 1 year ago

To properly implement SVM into vision systems, having text prompt support is essential. I was wondering if there are any plans in the near future to release support for text prompts?