UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.36k stars 2.49k forks source link

Constantly zero (or near zero) elements of the vectors of some of the older models #2662

Closed krumeto closed 3 months ago

krumeto commented 6 months ago

Hey team,

we noticed that some of the vectors of older models contain only zeroes or near zeros at certain vector positions. For example:

I just wanted to check if that behaviour is known/expected.

Code to reproduce:

import pandas as pd
from sentence_transformers import SentenceTransformer

sentences = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.",
    "This is an example sentence", 
    "Each sentence is converted",
    'Cat sat on the mat',
    'Hello there!'
    '11 22 33',
    'Провери с кирилица.'
    ]

model_list = [
    'sentence-transformers/all-mpnet-base-v2', 
    'sentence-transformers/all-distilroberta-v1',
    'sentence-transformers/all-MiniLM-L12-v2',
    'sentence-transformers/all-MiniLM-L6-v2'
    ]

for model_name in model_list:
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    print("#"*50)
    print(model_name)
    print(pd.DataFrame(embeddings).std().round(5).nsmallest(5))
krumeto commented 6 months ago

I checked a number of other models (

[
    'BAAI/bge-small-en-v1.5', 
    'avsolatorio/GIST-all-MiniLM-L6-v2',
    'Snowflake/snowflake-arctic-embed-s',
    'nomic-ai/nomic-embed-text-v1.5',
    'intfloat/multilingual-e5-large',
    'mixedbread-ai/mxbai-embed-large-v1'
    ]

) avsolatorio/GIST-all-MiniLM-L6-v2 has a very low standard deviation at the same vector positions as all-MiniLM-L6-v2 but it is not zero.

319    0.00005
223    0.00008
tomaarsen commented 6 months ago

Hello!

These are rather interesting findings. I think this speaks to the characteristics of the applied loss function (MultipleNegativeRankingLoss). I'm quite curious, what were your findings on the other models? Nothing out of the ordinary?

Either way, yes, this is not unreasonable. The standard deviation of these dimensions are chosen to be small by the loss function as it felt like that was best for the model. It does seem likely that this is "suboptimal", as you can likely prune those dimensions and keep (near) identical performance, and perhaps there is room for future loss functions to take better advantage of the entire vector space.

krumeto commented 6 months ago

Quite interesting! I re-ran the test (only slightly expanded, code below). These were the summary stats for standard deviations, sorted by the mean standard deviation (code below). I might have to normalise nomic's vectors

Screenshot 2024-05-22 at 14 45 45

import pandas as pd
from sentence_transformers import SentenceTransformer

sentences = [
    "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
    "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
    "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
    "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
    "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
    "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.",
    "This is an example sentence", 
    "Each sentence is converted",
    'Cat sat on the mat',
    'Hello there!'
    '11 22 33',
    'Провери с кирилица.',
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of strings.",
    "The quick brown fox jumps over the lazy dog.",
    "The cat sits outside",
    "A man is playing guitar",
    "The new movie is awesome",
    "The dog plays in the garden",
    "A woman watches TV",
    "The new movie is so great",
    ]

model_list = [
    'sentence-transformers/all-mpnet-base-v2', 
    'sentence-transformers/all-distilroberta-v1',
    'sentence-transformers/all-MiniLM-L12-v2',
    'sentence-transformers/all-MiniLM-L6-v2',
    'BAAI/bge-small-en-v1.5', 
    'avsolatorio/GIST-all-MiniLM-L6-v2',
    'Snowflake/snowflake-arctic-embed-s',
    'nomic-ai/nomic-embed-text-v1.5',
    'intfloat/multilingual-e5-large',
    'mixedbread-ai/mxbai-embed-large-v1'
    ]

st_devs = pd.DataFrame()

for model_name in model_list:
    model = SentenceTransformer(model_name, trust_remote_code=True)
    embeddings = model.encode(sentences)
    summary_stats = pd.Series(pd.DataFrame(embeddings).std().describe(), name=model_name)
    st_devs = pd.concat([st_devs, summary_stats], axis=1)

st_devs.T.sort_values('mean', ascending=False)

Please feel free to close the issue whenever you believe it is fitted! Thank you for the insights!

tomaarsen commented 6 months ago

Very interesting! I indeed think that some of the models are normalized, while others are not. Perhaps with normalize_embeddings=True on model.encode you'll get the clearest results?

krumeto commented 6 months ago

Re-ran with normalize_embeddings=True and the takeaways change significantly and are probably a bit more logical - smaller models having a slightly higher variance, the multilingual model being low variance on mostly English texts. Interesting how much lower the variance of GIST-all-MiniLM-L6-v2 vs. all-MiniLM-L6-v2.

Screenshot 2024-05-22 at 15 20 02
tomaarsen commented 6 months ago

Fascinating! In case some folks are interested in these experiments: @avsolatorio @aamir-s18 @zanussbaum @spacemanidol @intfloat

To the best of my knowledge, all of the sentence-transformers models are trained with MultipleNegativesRankingLoss, i.e. InfoNCE/SimCSE/in-batch negatives loss. Why this has resulted in these models having some dimensions always at 0 - I'm not sure.

krumeto commented 6 months ago

Thinking aloud, for clustering/topic modelling use cases (thinking of BERTopic), I am wondering if just taking the N most high-variance vector dims, rather than dimensionality reduction, would make sense.

ir2718 commented 6 months ago

Thinking aloud, for clustering/topic modelling use cases (thinking of BERTopic), I am wondering if just taking the N most high-variance vector dims, rather than dimensionality reduction, would make sense.

Not sure if this will work, as features which are linear combinations of existing features (this is exactly what PCA aims to do) might have higher variance than the aforementioned existing features themselves.