UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.88k stars 2.44k forks source link

EN-DE MS-Marco #1011

Open nero-nazok opened 3 years ago

nero-nazok commented 3 years ago

Hi @nreimers, Hi Sentence-transformers community,

First of all, I want to thank you for your continued support throughout the years. I have been following this repository for three years now and I'm amazed by the progress that is made on a monthly basis 👍.

After digging through several forums I discovered that many people are interested in multilingual information retrieval, especially German-English. We are no exception. We currently use your msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned which yields us already good results. Unfortunately this model seems to be outdated compared to your amazing v3 models and is therefore not mentioned in your docs.

My kind request / question is, if you could finish your work on EN-DE information retrieval (based on MS Marco), if there is enough demand? I think there are many people who have been waiting in anticipation for your EN-DE cross-encoder and v3 bi-encoder for quite some time. 😅

geraldwinkler commented 3 years ago

Hey everyone, we currently use the msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned in production. Thus, I can confirm that an EN-DE cross-encoder is a must-have. Cross-encoders are much more fine-grained and result in a significant performance boost. We really look forward to production-ready EN-DE information retrieval models (including a v3 bi-encoder 🙂).

paologruber commented 3 years ago

I can confirm that as well. I have searched all over the internet. There were many requests, but hardly could I find a working model. An EN-DE cross-encoder / bi-encoder is a must-have for us as well.

bmw-friedrich-mayr commented 3 years ago

Same for me I have been waiting for a EN-DE cross-encoder since this issue was opened https://github.com/UKPLab/sentence-transformers/issues/695 in January. Also a production-ready EN-DE bi-encoder (v3) would be great!!!

florian-hammertaler commented 3 years ago

Same here, could you possibly have a closer look on this @nreimers?

nils-tahler commented 3 years ago

Multilingual Information retrieval with more than 2 languages would be awesome. But still DE-EN are our main priority. Please add a multilingual cross-encoder with at least EN, DE to this repository @nreimers, if it is feasible. I think there are a lot of people who would benefit greatly.

nreimers commented 3 years ago

Sorry for my in-activity on this topic.

Will finish the work on sentence-transformers v2 this week. After that, I have some more free capacity. Will start to upload the German training data and finalize the translation code so that more languages can be supported.

Will then start training with this code: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_cross-encoder-v2.py

Based on mixed German and English data (and potentially more languages once the translated versions are ready).

ace-kay-law-neo commented 3 years ago

Thanks @nreimers, you're the best. Great commitment from your side!

nero-nazok commented 3 years ago

Great support as always @nreimers.

nreimers commented 3 years ago

Hi, SVALabs has trained the first models for German-English Bi- and Cross-Encoder: https://huggingface.co/svalabs/bi-electra-ms-marco-german-uncased https://huggingface.co/svalabs/cross-electra-ms-marco-german-uncased

They also uploaded a translated version of MS MARCO: https://huggingface.co/datasets/svalabs/ms-marco-german-translation-wmt19

Great to see that effort :)

More to come.

florian-hammertaler commented 3 years ago

This is great news @nreimers. Thanks for the info. More and more people become aware of the power of multilingual information retrieval.

janandreschweiger commented 3 years ago

Hey @nreimers, thanks for your fast reply. Are you sure these models support English-German and not just German?

nreimers commented 3 years ago

@janandreschweiger Baran Avinc said these are just for German.

Will release English-German Cross-Encoder models soon.

janandreschweiger commented 3 years ago

Perfect, thank you @nreimers.

barana91 commented 3 years ago

I'm sorry for the confusion, but the "cross" in the name comes from "cross-encoder". Our bi-encoder is also named bi-electra ( https://huggingface.co/svalabs/bi-electra-ms-marco-german-uncased). The next time we will choose a different naming scheme.

nero-nazok commented 3 years ago

Hello @nreimers, we have a live demo on Monday, 2.8. Do you think that a first version of these new models will be ready by then? If not, I would be happy if you could give us a rough time frame for the bi- and cross-encoders. Thanks a lot.

RachelKer commented 3 years ago

Great work @nreimers on multillinguality, dankeschön. I saw you pushed to the models hub the new multilingual minim v2 models, do you plan to use them for CrossEncoder knowledge distillation ?

I am too interested in multilingual cross encoders. I have translated some MS-Marco passages to French using your translation scripts (thanks again), and I have tried distilling using the cross-v2 script to those multi-minillm2 models, using only the English passages, the French translated passages, or a mix of both. In any case every time I get a CUDA memory error a few iterations in (around 9000), even when the only line of code that changed from your script is the model_name (from microsoft/MiniLM-L12-H384-uncased to 'nreimers/mMiniLMv2-L12-H384-distilled-from-XLMR-Large', since I wanted to test if a distillation using your method from a monolingual to a multilingual student using only English data yields anything).

What GPU did you use for these CrossEncoder-v2 scripts ?

nreimers commented 3 years ago

Hi @nero-nazok Sorry for the delays. There is an existing bi-encoder ready here that is aligned for EN-DE: https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned

I just started to train a cross-encoder for EN-DE. First results after 30k steps look quite promising, with some toy examples it gave quite good re-ranking results, but a more careful evaluation will be needed. Will release this model this week (and hopefully a better model further down the line).

Had to fix several issues with the Neural Machine Translation generation, it was sadly rather unstable when applied to noisy data as you have it in the MS MARCO case.

@RachelKer I tested the MiniLMv2 models for English CrossEncoder, but they were sadly not better than the MiniLMv1 models. Performance was on-par or in some cases even worse. Currently test the mMiniLMv2 models, maybe they are better than the multilingual MiniLMv1 models.

The max_length parameter is quite important, as memory requirement grows quadratic with the input length. So input text has twice the size => 4 times more GPU memory is needed. I let the models currently run with a max_length of 350 and a batch size of 16. In that case, about 14GB of GPU memory is needed.

bmw-friedrich-mayr commented 3 years ago

Thanks for the quick reply @nreimers. I also look forward to the new cross- and bi-encoder models (DE, EN).

nreimers commented 3 years ago

Hi @bmw-friedrich-mayr @nero-nazok @RachelKer I trained two EN-DE cross encoders on MS MARCO: https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1 https://huggingface.co/cross-encoder/msmarco-MiniLM-L12-en-de-v1

For some performance numbers, see: https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1#performance

I have not yet trained new bi-encoders.

On the test datasets (TREC DL for EN-EN and DE-EN and GermanDPR for DE-DE re-ranking), cross-encoder models perform quite well. We also see quite a nice boost compared to the bi-encoder models.

janandreschweiger commented 3 years ago

Thanks for all of your effort @nreimers. Great work as always! EN-DE retrieval models are really of interest to us. I look forward to the new bi-encoder.

bmw-friedrich-mayr commented 3 years ago

Thanks @nreimers for publishing them in time!!!

nero-nazok commented 3 years ago

Thanks @nreimers! I tried your cross encoder model today. At first glance, everything worked perfectly. The model understands the meaning of the text in both English and German. However, I have come across a devastating weakness that affects all cross encoders. We have a classic setup with a bi-encoder and a cross-encoder, as you described it here: https://www.sbert.net/examples/applications/information-retrieval/README.html

The problem: As soon as you search for short keywords or proper names, the cross encoders no longer work, especially if the models do not know these names.

Example: Search query: "Prospekt_Tigon_DE.pdf" Document 1: "Prospekt_Tigon_DE.pdf" Document 2: "Prospekt_Stromerzeuger_DE.pdf" In this case, the second document has a higher score than the first. This problem is most likely due to the reranking dataset, since all cross encoders are affected, but not the bi-encoders.

Tested models: cross-encoder / msmarco-MiniLM-L12-en-de-v1 (cross): Problem occurs cross-encoder / msmarco-MiniLM-L-6-v2 (cross): Problem occurs msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned (bi): no problem msmarco-distilbert-base-tas-b (bi): no problem

Since real users usually search with keywords and proper names, this problem unfortunately makes the cross-encoders unusable. Therefore, we currently only have a pure bi-encoder search, which is a shame. Do you have any idea how to solve this problem @nreiners?

ace-kay-law-neo commented 3 years ago

Same for me @nreimers. This problem makes the models worse than a keyword search. The bi-encoder on the other hand work perfectly. Hopefully this can be fixed soon. Thanks in advance!

Also many thanks to you @nero-nazok for writing a detailed report.

nreimers commented 3 years ago

Hi @nero-nazok thanks for reporting back. This is an interesting case.

For bi-encoders, cossim(A, A) = 1, so a perfect match will always rank the highest. For cross-encoder, this is not necessarily the case. What might help in your case is a query classifier. See here for a long thread with implementations: https://github.com/deepset-ai/haystack/issues/611

Basically, you use the CrossEncoder only when you notice that it is a question or more complex query, e.g. you could check if the query contains spaces. Or when only a single hit (or 2-3 hits) contains an exact string match => don't run the CrossEncoder.

When I run the following example:

from sentence_transformers import CrossEncoder

model_name='cross-encoder/msmarco-MiniLM-L12-en-de-v1'
model = CrossEncoder(model_name, max_length=512)

query = "Prospekt_Tigon_DE.pdf"
docs = ["Prospekt_Tigon_DE.pdf", "Prospekt_Stromerzeuger_DE.pdf"]
print(model_name)
print(model.predict([(query, doc) for doc in docs]))

model_name='cross-encoder/msmarco-MiniLM-L6-en-de-v1'
model = CrossEncoder(model_name, max_length=512)

print(model_name)
print(model.predict([(query, doc) for doc in docs]))

I get as an output:

cross-encoder/msmarco-MiniLM-L12-en-de-v1
[8.510739  3.3845112]

cross-encoder/msmarco-MiniLM-L6-en-de-v1
[8.483513  1.9748554]

So both cross-encoder give document 1 (Prospekt_Tigon_DE.pdf) a much higher score than document 2 (Prospekt_Stromerzeuger_DE.pdf).

Did you use some other query / docs? Or where these keywords embedded in more text?

Would be great if you could post a working (minimal) code example so that I can have a more detailed look.

nero-nazok commented 3 years ago

Hi @nreimers, thanks for your quick reply! The problem was on my side. We use Docker and save the models in the file system. I mistakenly saved the model with the SentenceTransformer class like for the bi-encoder.

Saving:

model = SentenceTransformer(model_name, device="cpu")  # <-- CHANGE TO CrossEncoder
model.save("/var/models/cross-encoder")

Loading:

model = CrossEncoder("/var/models/cross-encoder", device="cpu", max_length=512)
ace-kay-law-neo commented 3 years ago

Thanks @nero-nazok, I made a similar mistake. The cross-encoder models work perfectly now @nreimers.

nero-nazok commented 3 years ago

I just want to mention one shortcoming that we encountered. We develop an enterprise search. So we have long text files on various topics. The problem is that for transformer-models we have to divide a text file into its paragraphs. As a result, the context of the document is often lost.

Example: We have two documents, each of which describes a vehicle. These paragraphs are found with the following search text: Search query: How big is the panther's water tank? Found paragraph in Prospekt_Tigon_DE.pdf: The vehicle has a water tank with a capacity of 9,000 liters. Found paragraph in Prospekt_Panther_DE.pdf: The vehicle has a water tank with a capacity of 5,000 liters. The search however does not recognize which vehicle the found paragraphs refer to.

Our solution ideas: A: The context can usually be found on the first page of a document. Our search system currently adds the score of the first page to the score of the best paragraph. The problem is that this solution only effects the cross-encoder but not the bi-encoder. B: One could calculate an embedding for the whole document and add it to each vector. We also thought about adding a fraction of the first-page embedding to each vector. final_vector = first_page_vector * 0.2 + inital_vector

Maybe you have an idea about this problem @nreimers?

bmw-friedrich-mayr commented 3 years ago

@nreimers @nero-nazok I can confirm that this is a major drawback of a neural search. It is probably the biggest challenge my team and I are facing right now. A good solution to this problem would fundamentally change the game.

paologruber commented 3 years ago

Hello @nreimers! First off, thank you very much for the great models. EN-DE information retrieval is super important nowadays and has numerous fantastic use cases.

I did a detailed performance analysis of your 4 DE-EN models. If the search documents were in English, your models were almost always able to find the right result. But I noticed a significant drop in performance with German documents. Later I found your benchmarks, which confirmed that to me: https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1#performance

In particular, the bi-encoders suffer from this issue. GermanDPR DE-DE:

BM25: 35.85
msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned: 37.88
msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch: 38.32

That is a pity because 80% of our documents are German. Will the next version of Bi-encoders overcome this weakness?

bmw-friedrich-mayr commented 3 years ago

@paologruber I made some tests with about 150 custom queries. Unfortunately, I experience this issue as well. I think @nreimers mentioned that he will work on a new bi-encoder when the cross-encoders are finished.

nero-nazok commented 3 years ago

Is there already a timeline @bmw-friedrich-mayr?

nreimers commented 3 years ago

@nero-nazok @bmw-friedrich-mayr Yes, this is a challenge that there are paragraphs where it is not clear what the pronouns refer to.

If available, a good option is to encode your paragraphs like this: title+" "+paragraph

Where title reflects what the document is about. For Wikipedia, it would be the article title. You could also try to see if there is a "paragraph" title.

This allow the model to capture the larger context of the paragraph.

=========

I currently train more authentic training data for German. Hopefully it leads to an improvement. Will do a test with training additionally on GermanDPR, which should boost the German-German capability of the model.

nero-nazok commented 3 years ago

Thanks @nreimers for your answer. Can the new training data be used to close the DE performance gap of both bi- and cross-encoders?

Although the EN-EN and DE-EN capabilities are really useful, the German-German performance is the most important one for our business. Unfortunately, there are almost no good German retrieval models. Therefore I am really grateful for your work. Many thanks and good luck in advance for to the new models.

nreimers commented 3 years ago

@naro-nazok Not sure, will see if it helps training and by how much.

Cross-Lingual is usually quite challenging and sadly not too much good training data is available for other languages.

florian-hammertaler commented 3 years ago

Thanks @nreimers, the performance of most German information retrieval models is quite low. We really appreciate your effort to move the ball forward. A score that is comparable to English would be a game changer for us.

janandreschweiger commented 3 years ago

Great @nreimers, hopefully this will improve the overall performance. I have a bi-encoder, cross-encoder setup. Unfortunately, when just searching with a combination of keywords (e.g. "Audi CEO", "BMW revenue"), the model is significantly weaker than a keyword-based search.

ace-kay-law-neo commented 3 years ago

Thanks @nreimers for your work on DE-EN cross-encoders. I also look forward to a new DE-EN bi-encoder as all the others. It would be awesome to have a great German performance similar to English.

In addition, we also experience issues when searching through longer paragraphs (~ 250 words / paragraph) and the keywords are unknown to the model. This unfortunately happens often as we use the model on technical documents. It would be great if the model could still work at least as good as a keyword-search if the words are unknown. Do you have any ideas/plans regrading this @nreimers? Maybe there is another dataset that could be applied afterwards. Thank you!!

nreimers commented 3 years ago

Hi @ace-kay-law-neo Semantic search is not a replacement for keyword search, it is an addition. Semantic search will always fail when you search for keywords, even worse when the keywords are unknown during training or if they have no direct meaning (like error codes you are searching for).

Hence, for production settings, it makes sense to combine semantic search with keyword search, which is also known as hybrid search. Have a look at: http://ceur-ws.org/Vol-2696/paper_92.pdf https://arxiv.org/abs/1903.08690

You can either run dense and keyword search independently and merge the results (there are different options to do this), or you use one of the hybrid approaches.

There are not too many search softwares available that allow to run hybrid search. I think the only one I'm familiar with is Vespa.ai, which might support such hybrid search.

janandreschweiger commented 3 years ago

Hi @nreimers, my team has tested your EN-DE cross encoder quite a bit and we want to use it in production.

We have a Java environment so we run the models with an onnx runtime. We have rewritten the BertTokenizer to Java and it works like a charm for your bi-encoder. Unfortunately, the cross-encoder tokenizer is different to any tokenizer I have seen so far. For example: For subwording the cross-encoder tokenizer uses a '_' instead of '#' or '##'.

DE-EN Cross Encoder:

cross_model = "cross-encoder/msmarco-MiniLM-L6-en-de-v1"
tokenizer = AutoTokenizer.from_pretrained(cross_model)
print(tokenizer)
print('----------------#####-----------------')
print(tokenizer.__dict__)
print('----------------#####-----------------')
print(tokenizer.vocab)

PreTrainedTokenizerFast(name_or_path='cross-encoder/msmarco-MiniLM-L6-en-de-v1', vocab_size=250002, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'})
----------------#####----------------
{'_tokenizer': <tokenizers.Tokenizer at 0x552acb0>,
 '_decode_use_source_tokenizer': False,
 'init_inputs': (),
 'init_kwargs': {'bos_token': '<s>',
  'eos_token': '</s>',
  'sep_token': '</s>',
  'cls_token': '<s>',
  'unk_token': '<unk>',
  'pad_token': '<pad>',
  'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True),
  'special_tokens_map_file': '/root/.cache/huggingface/transformers/8ed73a1ab9ef4e90a9451497bf96cfc38d34354352838a143f2dda1c81aed5ca.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8',
  'name_or_path': 'cross-encoder/msmarco-MiniLM-L6-en-de-v1',
  'sp_model_kwargs': {}},
 'name_or_path': 'cross-encoder/msmarco-MiniLM-L6-en-de-v1',
 'model_max_length': 1000000000000000019884624838656,
 'padding_side': 'right',
 'model_input_names': ['input_ids', 'attention_mask'],
 'deprecation_warnings': {},
 '_bos_token': '<s>',
 '_eos_token': '</s>',
 '_unk_token': '<unk>',
 '_sep_token': '</s>',
 '_pad_token': '<pad>',
 '_cls_token': '<s>',
 '_mask_token': '<mask>',
 '_pad_token_type_id': 0,
 '_additional_special_tokens': [],
 'verbose': True,
 'vocab_file': '/home/jan/.cache/huggingface/transformers/2d153550dea7e047f8398edccfe5ebb510023a37082daa8996ef5da53f10e27a.71e50b08dbe7e5375398e165096cacc3d2086119d6a449364490da6908de655e'}
----------------#####----------------
{'▁најголем': 74987,
 'yı': 8788,
 'cour': 139108,
 '▁шуурхай': 201319,
 '關鍵': 69471,
 '▁sky': 20704,
 '▁метою': 53873,
 '▁ёстой': 35230,
 '▁7.2': 134776,
... 250000 more lines

Regular Cross Encoder:

cross_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
tokenizer = AutoTokenizer.from_pretrained(cross_model)
print(tokenizer)
print('----------------#####-----------------')
print(tokenizer.__dict__)
print('----------------#####-----------------')
print(tokenizer.vocab)

PreTrainedTokenizerFast(name_or_path='cross-encoder/ms-marco-MiniLM-L-6-v2', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
----------------#####----------------
{'_tokenizer': <tokenizers.Tokenizer at 0x7ebf4d0>,
 '_decode_use_source_tokenizer': False,
 'init_inputs': (),
 'init_kwargs': {'do_lower_case': True,
  'unk_token': '[UNK]',
  'sep_token': '[SEP]',
  'pad_token': '[PAD]',
  'cls_token': '[CLS]',
  'mask_token': '[MASK]',
  'tokenize_chinese_chars': True,
  'strip_accents': None,
  'model_max_length': 512,
  'name_or_path': 'cross-encoder/ms-marco-MiniLM-L-6-v2',
  'do_basic_tokenize': True,
  'never_split': None,
  'special_tokens_map_file': '/home/jan/.cache/huggingface/transformers/3295d833faab1b0a5258c61d5d6ba3db7c2414aca8614a8503c6deb89fc00611.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d',
  'tokenizer_file': None},
 'name_or_path': 'cross-encoder/ms-marco-MiniLM-L-6-v2',
 'model_max_length': 512,
 'padding_side': 'right',
 'model_input_names': ['input_ids', 'token_type_ids', 'attention_mask'],
 'deprecation_warnings': {},
 '_bos_token': None,
 '_eos_token': None,
 '_unk_token': '[UNK]',
 '_sep_token': '[SEP]',
 '_pad_token': '[PAD]',
 '_cls_token': '[CLS]',
 '_mask_token': '[MASK]',
 '_pad_token_type_id': 0,
 '_additional_special_tokens': [],
 'verbose': True,
 'do_lower_case': True}
----------------#####----------------
{'##ent': 4765,
 'endeavour': 26911,
 'winning': 3045,
 'vaccines': 28896,
 '##ries': 5134,
 'sessions': 6521,
 '115': 10630,
 'drummond': 19266,
 '##iaceae': 23357,
 '‚': 1522,
 'allies': 6956,
... 119000 more lines

As you can see both cross-encoders are quite different although they are from the same class PreTrainedTokenizerFast. It would be awesome if you could give us some information about how this tokenizer works, so we can replicate it in Java.

Thanks a lot @nreimers!!

nreimers commented 3 years ago

Hi @janandreschweiger Happy to hear that :)

Multilingual models usually use a SentencePiece tokenizer: https://github.com/google/sentencepiece

Founds this Java version, don't know if it works: https://github.com/levyfan/sentencepiece-jni

janandreschweiger commented 3 years ago

Thank you @nreimers. This helps us a lot.

But, your bi-encoder (sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch) uses a regular BertTokenizer with just a different vocabulary right?

Sample vocabulary of your bi-encoder:

##olo
われた
ปี
війни
lungo
##σία
2010년
##ությամբ
##най
Kunst
eigen
##eria
##ime
روستا
##cker
thức
أنه
##til
hotel
Bereich
Valle
samen
##య
##rada
علي
nreimers commented 3 years ago

@janandreschweiger Yes, the bi-encoder is based on DistilmBERT, which uses word pieces.

Word pieces has several issues, especially for multilingual tokenization. Hence, more recent mutlilingual transformer models use SentencePiece instead of word piece tokenization.

nickchomey commented 1 year ago

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf

@nreimers I made this same comment in various other Issues that I found so that a) more people could learn about this and b) so that it could all be consolidated in one place for you to close a bunch of issues.

Since this seems to be an important innovation and there's surely many other issues that I didn't find/tag, perhaps it would be worth adding this model to the SBERT documentation, and maybe even making some sort of announcement?

Edit: It would also be interesting to see how this new dataset, MIRACL, compares: https://github.com/project-miracl/miracl