Open surmenok opened 1 year ago
These are point and bounding box embedding sizes. The released code doesn't have text prompting. FAQ on the site says: "Text prompts are explored in our paper but the capability is not released".
Yes, if you go through the code (in particular prompt_encoder.py
) you will see that only point and bbox embeddings are supported in Sparse embedding.
Any plans of releasing text prompt support in the near future?
To properly implement SVM into vision systems, having text prompt support is essential. I was wondering if there are any plans in the near future to release support for text prompts?
Thank you for releasing the model. The paper mentions that text prompts are encoded using a pretrained
ViT-L/14@336px
CLIP model. CLIP embedding from this model are of size 768 while SegmentAnything prompt embedding is of size 256. Are there any extra steps for converting CLIP embeddings before feeding into the model?