model encode with specified tokens hidden state

riyajatar37003 commented 2 months ago

HI

Can i pass token position from where i want to extract embedding ? instead of [CLS] if i want to extract from other special token , how can i do it in model.encode

tomaarsen commented 1 month ago

Hello!

There are (at least) 2 ways to do this:

use output_value="token_embeddings" in model.encode() to get the non-pooled embeddings. Then you can also use model.tokenize to figure out where your special token is, and then you can use the token embedding at that position.
Create a custom Pooling module. In the forward method, you get a dictionary with inputs, and the module has to add a sentence_embedding key to the dictionary.

Hope this helps. Do note that if a model isn't trained to produce meaningful embeddings at this special token, then the embeddings might not be super useful.

Tom Aarsen

noobimp commented 1 month ago

Hello!

There are (at least) 2 ways to do this:

use output_value="token_embeddings" in model.encode() to get the non-pooled embeddings. Then you can also use model.tokenize to figure out where your special token is, and then you can use the token embedding at that position.

Create a custom Pooling module. In the forward method, you get a dictionary with inputs, and the module has to add a sentence_embedding key to the dictionary.

Hope this helps. Do note that if a model isn't trained to produce meaningful embeddings at this special token, then the embeddings might not be super useful.

Tom Aarsen

Hi, I have a related question. I noticed that MEAN pooling is used in SBERT paper, and [CLS] is used in miniLM paper. I want to know if I use sentence-transformers to load miniLM and encode sentence, which strategy will be used? Namely, the embedding I got is [CLS] embedding or the MEAN pooling embedding? Thank you for your help!

tomaarsen commented 1 month ago

Hello!

The pooling that is used is stored in the ..._Pooling/config.json file, e.g. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/1_Pooling/config.json

Here you can see that "pooling_mode_mean_tokens" is True, denoting that MEAN pooling is used. For another model, e.g. https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/blob/main/1_Pooling/config.json, "pooling_mode_cls_token" is True, denoting CLS pooling.

If a model does not have this file, like bert-base-uncased, then the default of MEAN pooling will be used.

Tom Aarsen

UKPLab / sentence-transformers

model encode with specified tokens hidden state #2787