UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.73k stars 2.43k forks source link

model encode with specified tokens hidden state #2787

Open riyajatar37003 opened 2 months ago

riyajatar37003 commented 2 months ago

HI

Can i pass token position from where i want to extract embedding ? instead of [CLS] if i want to extract from other special token , how can i do it in model.encode

tomaarsen commented 1 month ago

Hello!

There are (at least) 2 ways to do this:

  1. use output_value="token_embeddings" in model.encode() to get the non-pooled embeddings. Then you can also use model.tokenize to figure out where your special token is, and then you can use the token embedding at that position.
  2. Create a custom Pooling module. In the forward method, you get a dictionary with inputs, and the module has to add a sentence_embedding key to the dictionary.

Hope this helps. Do note that if a model isn't trained to produce meaningful embeddings at this special token, then the embeddings might not be super useful.

noobimp commented 1 month ago

Hello!

There are (at least) 2 ways to do this:

  1. use output_value="token_embeddings" in model.encode() to get the non-pooled embeddings. Then you can also use model.tokenize to figure out where your special token is, and then you can use the token embedding at that position.
  2. Create a custom Pooling module. In the forward method, you get a dictionary with inputs, and the module has to add a sentence_embedding key to the dictionary.

Hope this helps. Do note that if a model isn't trained to produce meaningful embeddings at this special token, then the embeddings might not be super useful.

  • Tom Aarsen

Hi, I have a related question. I noticed that MEAN pooling is used in SBERT paper, and [CLS] is used in miniLM paper. I want to know if I use sentence-transformers to load miniLM and encode sentence, which strategy will be used? Namely, the embedding I got is [CLS] embedding or the MEAN pooling embedding? Thank you for your help!

tomaarsen commented 1 month ago

Hello!

The pooling that is used is stored in the ..._Pooling/config.json file, e.g. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/1_Pooling/config.json

Here you can see that "pooling_mode_mean_tokens" is True, denoting that MEAN pooling is used. For another model, e.g. https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/blob/main/1_Pooling/config.json, "pooling_mode_cls_token" is True, denoting CLS pooling.

If a model does not have this file, like bert-base-uncased, then the default of MEAN pooling will be used.