Useful embedding models from sentence-transformers

EyeofBeholder-NLeSC / orange3-argument

Argument analysis, mining, and visualization add-on for Orange3.

https://research-software-directory.org/software/orange3-argument-add-on

Apache License 2.0

2 stars 1 forks source link

Useful embedding models from sentence-transformers #32

Closed jiqicn closed 1 year ago

jiqicn commented 1 year ago

This thread is for collecting a list of useful models as options to choose from.

For now, we only consider Models that are evaluated (see the full list here). In the future, it's also possible to provide access to all possible models ST models on HuggingFace model hub.

jiqicn commented 1 year ago

all-* models

The all-* models were trained on all available training data (more than 1 billion training pairs) and are designed as general-purpose models. Some of the models have two versions (v1 and v2) and, typically, v2 have longer input text word pieces than v1. Below, we choose models based on performance.

Options:

all-MiniLM-L12-v1 (68.83, 120MB)
all-mpnet-base-v1 (69.98, 420MB)
all-distilroberta-v1 (68.73, 290MB)
all-roberta-large-v1 (70.23, 1360MB)

jiqicn commented 1 year ago

Average models

The following models apply compute the average word embedding for some well-known word embedding methods. Their computation speed is much higher than the transformer based models, but the quality of the embeddings are worse (±20 lower on performance score).

Options:

average_word_embeddings_glove.6B.300d (49.79, 420MB)
average_word_embeddings_komninos (51.13, 240MB)

jiqicn commented 1 year ago

Multi-lingual models

The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. Performance of these models, comparing to standard English models, are typically weaker on English.

Options:

distiluse-base-multilingual-cased-v1 (61.30, 480MB, support fewer languages than v2, but perform better)
paraphrase-multilingual-MiniLM-L12-v2 (64.25, 420MB)
paraphrase-multilingual-mpnet-base-v2 (65.83, 970MB)

jiqicn commented 1 year ago

Other models

There are also some other models that are trained for particular tasks, but also performs quite well in general (see performance score of sentence embedding in this table)

Options:

gtr-t5-large (69.90, 640MB, trained particularly for semantic search)
sentence-t5-large (68.74, 640MB, trained particularly for sentence similarity)

jiqicn commented 1 year ago

By comparison, all-mpnet-base-v1 is chosen as the default option to give to the user, considering its performance and size.

NB: it would be nice to also give this performance score and size information to the user when they are choosing from the models.

jiqicn commented 1 year ago

USE can also be a nice option here. However, it's hard to say how the USE model is trained, even by reading their papers.

jiqicn commented 1 year ago

We can also provide TF-IDF as one of the options there.