UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.48k stars 2.5k forks source link

Multi-lingual cross-encoders #739

Open drchrisscheier opened 3 years ago

drchrisscheier commented 3 years ago

Hi there, thank you very much for this extremely helpful/useful library!

quick question: are there multilingual cross-encoders available? Going through the docs, I could not find an explicit reference/example of leveraging multilingual cross-encoders. When going through the link of available models, I found this candidate stsb-xlm-r-multilingual.zip - would that be the correct package?

Thank you!

nreimers commented 3 years ago

Currently there are none.

Which one would be interesting for you?

drchrisscheier commented 3 years ago

Thank you very the quick reply. For my use case, priorities would be in this order:

  1. German
  2. Chinese (simplified)
  3. Spanish
  4. French

Thanks!

nreimers commented 3 years ago

Thanks. But which models, for which tasks would be relevant?

drchrisscheier commented 3 years ago

sorry :) Task is to obtain similarity scores between sentence pairs / (sentence-label) pairs, like shown below. I am currently using the below model, which works very well for me. Another application would be along your semantic search example (using cosine sim for preflitering top_n, then re-ranking with cross-encoder)

model = CrossEncoder('sentence-transformers/ce-distilroberta-base-stsb') scores = model.predict([["He won the game", "strength"],
["He won the game", "competition"], ["He won the game", "achievement"], ["He won the game", "performance"], ["He won the game", "caring"]])

dingusagar commented 3 years ago

I am searching for multi-lingual cross encoder model for Indonesian language. are there any such models that can help me?

nreimers commented 3 years ago

@dingusagar There are no trained multi-lingual cross encoders available so far

nickchomey commented 2 years ago

I just made this comment in another similar issue - it should solve this problem.

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf

Liweixin22 commented 1 year ago

I just made this comment in another similar issue - it should solve this problem.

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf

hi nickchomey, this multilingual Cross Encoder model can't found in huggingface.|

nickchomey commented 1 year ago

You can search huggingface for models. Here's the same one

https://huggingface.co/nreimers/mmarco-mMiniLMv2-L12-H384-v1

flitzcore commented 5 months ago

@dingusagar have u find any indonesian cross encoer yet?

dingusagar commented 5 months ago

@flitzcore, no actually I did not invest much time on cross encoders since they were computationally expensive when you have lots of vectors. The multilingual sbert bi-encoder models were doing the job for me with some extra post-processing for my task.

tomaarsen commented 5 months ago

If you're okay with the license, then perhaps https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual is a solid option?