Multilingual Information Retrieval (MS-Marco Bi-Encoders)

janandreschweiger commented 3 years ago

Hey everyone! First of all, congratulations for your new Information Retrieval models. They are absolutely amazing.

My Question / Kind Request We currently use your new Bi-Encoder msmarco-distilroberta-base-v2 and desperately need a german-english model. There are great models like T-Systems-onsite/cross-en-de-roberta-sentence-transformer (EN-DE) and paraphrase-xlm-r-multilingual-v1 (50+ languages) that do a great job. I think that there are a lot of people out there who would like to have such a model fine-trained on ms-marco.

The Reason Although T-Systems-onsite/cross-en-de-roberta-sentence-transformer and paraphrase-xlm-r-multilingual-v1 are outstanding models, they don't perform good in a real world search-system. They don't work well with short query inputs and longer paragraphs in the knowledge base. Your Bi-Encoders are a game changer. They make transformers really applicable.

Keep it up You always surprise me about with what transformers are capable of. Thanks for all of your effort and keep up the great work!

ace-kay-law-neo commented 3 years ago

Hi @janandreschweiger, I also think that Bi-Encoders have now become more important that classical textual similarity models. A German-English model (or at least a multilingual one) is something we also need.

nreimers commented 3 years ago

Hi @janandreschweiger @ace-kay-law-neo Happy to hear that, thanks for the compliments :)

I currently work on creating multi-lingual MS MARCO models for about 20 languages, including German.

But creating this model takes quite some time, as the translating the training data takes quite some time. But for German the translation is luckily already done, so if I find the time I hope I can release a German-English model sooner.

janandreschweiger commented 3 years ago

Wow, thanks for the quick response @nreimers! Your are really moving fast. I really look forward to your german-english model 👍

Thank you!!

nero-nazok commented 3 years ago

Thank you @nreimers. My team and I have been looking for a german-english model all over the internet.

ace-kay-law-neo commented 3 years ago

Thanks @nreimers, that is really nice of you!

geraldwinkler commented 3 years ago

Hey @nreimers, my firm is particularly interested in the German-English model. So making a prerelease is very kind of you.

We have a public demo on 26th Jan. Do you think that the model will be ready until then, as the translation has already been done?

Thanks again @nreimers for putting so much effort into this framework. We have been following your repository for over 2 years now and it is definitely the best of its kind.

nreimers commented 3 years ago

@geraldwinkler The training for a German-English model just started. I hope I coded everything correctly and that it will result in a functional model. I will keep you updated

geraldwinkler commented 3 years ago

Thanks @nreimers! It would be great if the model is available until monday. But please take the time it requires for training. Accuracy is more important than publishing the model as early as possible!

Great support from your side!

nreimers commented 3 years ago

Hi @geraldwinkler This should be do-able.

The first model is ready, on TREC-19-DL the NDCG@10 scores are (English Queries & Documents only): English-Model: 68.4 English-German-Model: 65.5

So we see some drop compared to the English-only model. I tested it also in the cross-lingual direction, i.e., German queries vs. English documents and English queries vs. German documents: The returned hits look quite promising.

I have some other experiments running to see if they can narrow the performance gap between the English-only and the bi-lingual model.

janandreschweiger commented 3 years ago

Hey @nreimers that are really good news!

Regarding the drop in performance, have you thought about just fine-tuning a semantic similarity model on MS-Marco? Models like T-Systems-onsite/cross-en-de-roberta-sentence-transformer already have a profound semantic understanding of multiple languages. In addition, fine-tuning a semantic similarity model would definitely overcome the bottlenecks described in this article: https://towardsdatascience.com/why-you-should-not-use-ms-marco-to-evaluate-semantic-search-20affc993f0b

ace-kay-law-neo commented 3 years ago

@janandreschweiger I agree with that

nreimers commented 3 years ago

Hi @janandreschweiger @ace-kay-law-neo Models trained / tuned on STS work quite badly for semantic search if your tasks involves a short query (like a question) and retrieval of a longer paragraph that provides the answer.

So here are some examples of the msmarco-distilbert-base-v2 model vs. T-Systems-onsite/cross-en-de-roberta-sentence-transformer model (retrieval over 50k passages from Simple Wikipedia, Top-1 hit):

Query: What is the capital of france? msmarco: Paris (nicknamed the ""City of light"") is the capital city of France, and the largest city in France. The area is , and around 2.15 million people live there. If suburbs are counted, the population of the Paris area rises to 12 million people. sts: The name "France" comes from the Latin word Francia ', which means "land of the Franks" or "Frankland".

Query: What is the best orchestra in the world? msmarco: Some of the greatest orchestras today include: the New York Philharmonic Orchestra, the Boston Symphony Orchestra, the Chicago Symphony Orchestra, the Cleveland Orchestra, the Los Angeles Philharmonic Orchestra, the London Symphony Orchestra, the London Philharmonic Orchestra, the BBC Symphony Orchestra, the Royal Concertgebouw Orchestra, the Vienna Philharmonic Orchestra, the Berlin Philharmonic Orchestra, the Leipzig Gewandhaus Orchestra, the , the St Petersburg Philharmonic Orchestra, the Israel Philharmonic Orchestra, and the NHK Symphony Orchestra (Tokyo). Opera houses usually have their own orchestra, e.g. the orchestras of the Metropolitan Opera House, La Scala, or the Royal Opera House. sts: A large orchestra is often called a “symphony orchestra”. This is to distinguish it from a small orchestra called a “chamber orchestra”.

STS benchmark has several issues:

It is extremely narrow in the topics it covers. There are basically no sentences included that require some domain knowledge. Nearly all sentences are rather simple sentences and most are rather artificial stemming from SNLI describing images. If your corpus just have sentences like "A plane is taking off.", then using STSb is fine. But if you have more complex sentences like 'Attention deficit hyperactivity disorder (ADHD) is a neurodevelopmental disorder characterized by inattention, or excessive activity and impulsivity, which are otherwise not appropriate for a person's age.', then the STS models are quite lost.
STS benchmark has an unrealistic, skewed score distribution: If you randomly select two sentences, the chances that these are semantically similar is extremely small. However, in STSb you have a rather uniform score distribution and highly similar pairs are as likely as dissimilar pairs. This makes the STSb dataset rather bad for information retrieval / semantic search. For the query "What is the capital of France", only few out of the ~40 Million paragraphs in the English Wikipedia contain the answer, 39.999... million paragraphs will be irrelevant.

Regarding the shared blog post: The situation is not as bad as described in the blog post. Every dataset for semantic search / information retrieval has a selection bias. Is is impossible to create a dataset without a selection bias, because otherwise you would need to label for every query 8 Million documents (for the MS MARCO case). This is of course not possible.

So as the author writes at the end:

But despite all those remarks, the most important point here is that if we want to investigate the power and limitations of semantic vectors (pre-trained or not), we should ideally prioritize datasets that are less biased towards term-matching signals. This might be an obvious conclusion, but what is not obvious to us at this moment is where to find those datasets since the bias reported here are likely present in many other datasets due to similar data collection designs.

The described biases are well known and a long recognized problem. But there are no good solutions for it, especially not at scale.

So in conclusion:

Is the MS MARCO dataset perfect: No
Is the MS MARCO dataset useful for training and evaluation: Yes, I definitely think so. It is far better than many other datasets that could be used (STSbenchmark, SQuAD)
Are there better datasets: The Natural Questions (NQ) dataset from Google is interesting, but more targeted on answering questions using Wikipedia. Models trained on NQ will be released soon. So for question-answer retrieval with Wikipedia, I think NQ is better. For broader retrieval, I cannot say yet which is better (MS MARCO / NQ). MS MARCO has also many keyword style queries and non-wikipedia-answerable queries like "weather san diego", which matches broader what people are searching for.
Will a better score on MS MARCO mean better performance on my task: No. At some point, models will be too specialized on MS MARCO and its selection bias, i.e., they don't get better, they are just better overfitting on MS MARCO selection bias. I sadly don't know at which point (i.e. score range) this will happen.
Is a perfect dataset for semantic search possible? Sadly not, creating one even for evaluation is rather expensive. And sadly you always have some selection bias in it

janandreschweiger commented 3 years ago

Thanks for the clarification @nreimers!

nreimers commented 3 years ago

Hi @janandreschweiger @geraldwinkler @ace-kay-law-neo @nero-nazok

I uploaded two models: msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch This model was trained from scratch with a DistilBERT-multilingual model on the English and translated German queries.

msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned This model used multilingual knowledge distillation to make the English msmarco-distilbert model multilingual.

Performance (Query-Language vs Corpus Language):	Model	TREC 19 EN-EN (NDCG@10)	TREC 19 DE-EN (NDCG@10)
msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch	65.51	58.69
msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned	65.45	59.6

I would be happy if you could share your feedback, i.e. do these model work? Which one of the two works better?

janandreschweiger commented 3 years ago

Thanks for publishing the models @nreimers. I really appreciate the hard work you're putting into this repository!! I will run some tests over the night and share my results tomorrow morning.

geraldwinkler commented 3 years ago

@nreimers Give yourself a 🚀! I made some short tests and I have to say that the results are absolutely astonishing. My team will test your model in detail next week. Thank you!!

nreimers commented 3 years ago

@geraldwinkler Thanks, happy to hear that :)

janandreschweiger commented 3 years ago

Hey @nreimers, I compared your models with 40 handcrafted queries targeting multilingual semantic meaning. I can confirm that the results are quite good. Unfortunately, the model sometimes fails at capturing the semantic meaning of abbreviations (e.g. VM for virtual machine, AI/KI for artificial intelligence, MS for Microsoft).

In addition, the msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned seems to outperform the msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch model.

Some Questions:

As you have mentioned, there is a drop in performance. Do you have any plans on how the model could be further improved?
We have seen that a higher number of languages decrease the accuracy. From their name I guess that these are just tmp models. But what do you think about staying with the DE-EN only model (and maybe some for other languages)? This issue shows that there is quite some demand. I think this is much better than having a weak model with 20 languages.
Our use case requires an understanding of some basic technical terms. Is it a good idea to fine-tune your model on the stackoverflow dataset? I don't have time for this now, but maybe in a month.

Again thanks for all your effort. I was deeply impressed of how good your first models already work. I stay tuned about what is further possible!

nreimers commented 3 years ago

@janandreschweiger Great, thanks for the feedback. Also for the feedback which model works better: Both were trained in quite different ways and I am currently try to figure out which works better

Unfortunately, the model sometimes fails at capturing the semantic meaning of abbreviations (e.g. VM for virtual machine, AI/KI for artificial intelligence, MS for Microsoft). Yes, abbreviations can be quite difficult, especially when they did not appear in the training data.

As you have mentioned, there is a drop in performance. Do you have any plans on how the model could be further improved? We have seen that a higher number of languages decrease the accuracy. From their name I guess that these are just tmp models. But what do you think about staying with the DE-EN only model (and maybe some for other languages)? This issue shows that there is quite some demand. I think this is much better than having a weak model with 20 languages. Currently I work on improving the multilingual knowledge distillation approach that was used to train the 'msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned' model. It is rather slow when longer text has to be embedded. I hope with a more efficient version this will lead to better results.

Regarding multiple languages vs. two languages: This will depend on the results. It depends on the capacity of the model: At some point, more languages requires more capacity (larger model) and adding more languages will decrease the performance. I will have to see how many languages will fit in the distil-models: Maybe 2 languages are optimal, maybe 10 or 20 languages work without problem.

Our use case requires an understanding of some basic technical terms. Is it a good idea to fine-tune your model on the stackoverflow dataset? I don't have time for this now, but maybe in a month. When trained on this dataset, the model will be able to answer programming questions, i.e., it will have a good understanding about technical terms.

For your use-case, it could make sense to try and fine-tune the ms-marco further on such a dataset

AmitChaulwar commented 3 years ago

So, if I understand correctly, these multilingual semantic search models are trained from scratch using the script train_bi-encoder.py on the multilingual MS Marco dataset? Have you documented somewhere, which 20 languages are you going to use?

nreimers commented 3 years ago

So, if I understand correctly, these multilingual semantic search models are trained from scratch using the script train_bi-encoder.py on the multilingual MS Marco dataset?

msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch was trained like this. The other (lng-aligned) model was trained using multilingual knowledge distillation

Have you documented somewhere, which 20 languages are you going to use? I have not decided on this yet. Likely, I will use the top k most common languages (the languages where the most parallel training data is available).

AmitChaulwar commented 3 years ago

Thanks for replying on weekend :). Are you planning to open source the translated dataset? We were also thinking of creating the translated dataset. But, it would be nice (obviously) if we find it somewhere. Perhaps, we could generate the data for other languages, if you describe the procedure and we can also open-source it.

It is still unclear to me how did you distill the QA model with multilingual knowledge distillation. I actually tried it with Ted dataset and msmarco-distilbert-base-v2 model as teacher model (perhaps you remember my question in another issue) but I did not get good results. So, here you used MS Marco dataset in English-German as parallel sentence dataset? Is it possible to open source this particular script as well?

nreimers commented 3 years ago

Hi @AmitChaulwar I will try to open-source it. But I have to check with the MS license if this is allowed.

It is still unclear to me how did you distill the QA model with multilingual knowledge distillation. I actually tried it with Ted dataset and msmarco-distilbert-base-v2 model as teacher model (perhaps you remember my question in another issue) but I did not get good results. So, here you used MS Marco dataset in English-German as parallel sentence dataset? Is it possible to open source this particular script as well?

The issue when you use the TED 2020 dataset is:

The model has to understand queries like "weather San Diego", but TED 2020 has only complete (well formulated) queries
Even worse: The multilingual MS MARCO model has to map longer text spans (passages) with several hundred words in a vector space. In the TED 2020 corpus, you only have individually sentences.

So I think when you use multilingual knowledge distillation with the TED2020 corpus, the mapping of the queries will not be so good and the mapping of longer passages will be quite bad.

So what did I do: 1) I translated all queries using EasyNMT (will be released tomorrow) 2) I translated all passages in MS MARCO with EasyNMT 3) I adapted the make_multilingual.py by increasing the max_seq_length and train_max_sentence_length (to match the longer nature of the MS MARCO model). I then trained with 3 datasts: TED2020, the translated queries dataset, and the translated passages dataset.

I think when you download the model: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned.zip

It includes the train.py script

The multilingual knowledge distillation was quite slow here, as it had to match quite long text spans. So my plan is to make this part significantly faster. I will then open-source the complete pipeline of translating & training.

AmitChaulwar commented 3 years ago

Thanks for the detailed response. I did try using MS Marco (English) dataset for distillation. But, I am trying way to smaller student models. I know the results with smaller models will be lower but they are far lower than I expected.

Looking forward to your complete pipeline of translating & training.

ace-kay-law-neo commented 3 years ago

He @nreimers, thanks for making a prerelease. I really appreciate your work. Just like @janandreschweiger and some others, I am especially interested in the German-English model. Do you think the performance can be further improved to achieve top performance? Thanks for all your effort 👍

nreimers commented 3 years ago

Do you think the performance can be further improved to achieve top performance?

I am working on it.

Note: The leaderboard on the linked website shows all type of approaches.

The top-ranking approaches usually combine dense-retrieval with a cross-encoder. This combination works much better for MS MARCO than just using the bi-encoder. On the pre-trained site, I just report scores for only the dense-retrieval. Pre-trained cross-encoder models are also available, which give quite a performance boost (for the MS MARCO dataset).

Also some of the top-approaches are computationally infeasible for any real-case scenarios, as they need a minute or longer to return the hits. For a leader board competition this is no issue if your search approach needs a minute per query. But I don't know any user who would be happy to wait so long to get the hits returned.

ace-kay-law-neo commented 3 years ago

Thanks for your detailed explanation @nreimers. As you said Crossencoders give a quite good performance boost on the MS MARCO dataset. Do you think that a search for technical topics like ours would benefit from them?

nreimers commented 3 years ago

Hi @ace-kay-law-neo Yes, I think it could be helpful.

Cross-Encoders have several advantages compared to dense encoders:

They generalize better across domains, i.e., you trained on domain X but apply it on domain Y. Cross-encoder achieve better results than bi-encoders (https://arxiv.org/abs/2010.08240)
Cross-Encoders are more data efficient: You get better results with fewer data (https://arxiv.org/abs/2010.08240)
Cross-Encoders are better at fine-grained checks.

For example (real world example from https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/information-retrieval/qa_retrieval_simple_wikipedia.py), when you query 'How many people live in London' a bi-encoder might find as closest sentence/paragraph the following: 'It has 2,000 inhabitants.'

The cross-encoder directly spots that this document does not mention London and that it is unclear what 'it' is. But it is quite hard for a bi-encoder to do such fine comparisons.

ace-kay-law-neo commented 3 years ago

Thanks for your reply @nreimers. We'll definitely try out crossencoders.

datistiquo commented 3 years ago

@nreimers Are your models above suitable for queries and answers in both german suitable or just for cross language (DE-EN)? msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch

nreimers commented 3 years ago

@datistiquo Both models can be used for all 4 combinations: EN-EN, EN-DE, DE-EN, DE-DE

datistiquo commented 3 years ago

@nreimers These models from above were trained in a bi-encoder manner, right? At another place you stated that these can also be usable as a base for cross-encoder? Both arcitectures are of course different in training and task, why do you think that this is possible? I can somehow understand the direction of bi-encoder -> cross-encoder. But what is about using a cross-encoder model in a bi-encoder setting?

nreimers commented 3 years ago

@datistiquo I just mentioned that it is possible, as bi- and cross-encoder use the same architecture (BERT). If it make sense and achieves better results: I don't know, I have not tested it.

AmitChaulwar commented 3 years ago

The multilingual knowledge distillation was quite slow here, as it had to match quite long text spans. So my plan is to make this part significantly faster. I will then open-source the complete pipeline of translating & training.

Do you have any rough estimation when can you open source at least this?

nreimers commented 3 years ago

I started to put some code here, but sadly not that much yet: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/ms_marco/multilingual

It should be available until mid March (latest).

janandreschweiger commented 3 years ago

Hi @nreimers! We have achieved great results with your EN-DE bi-encoder so far. Thank you for your continuous support!

As you explained, cross-encoders significantly boost performance. Will there also be an EN-DE cross-encoder if there is enough demand?

geraldwinkler commented 3 years ago

Hey @janandreschweiger @nreimers. That's funny, I was about to ask the same question. A cross-encoder for german-english is definitely something many would benefit from. 👍

nreimers commented 3 years ago

Hi @janandreschweiger @geraldwinkler Happy to hear that the EN-DE model work well. Thanks for the feedback.

Just started the training for DistilmBERT for MS MARCO Cross Encoder. Classification tasks work quite well when you train mBERT without having specific labeled data in your language. Will see if it will also work well for Cross-Encoders and returns sensible labels for German.

Otherwise need to train it also on the translated corpus.

PaulForInvent commented 3 years ago

@nreimers I assume these tmp models are cased model? Do you have any hints on what kind of preprocessing of my text data I have to do to match your training process? So, what have you done on your training data?

nreimers commented 3 years ago

Hi @PaulForInvent Yes, the models are cased. No specific preprocessing was used, the MS MARCO dataset was used as is.

nreimers commented 3 years ago

Hi @janandreschweiger @geraldwinkler Just training the cross-encoders with a multi-lingual model on the English data sadly did not yield the best results for German. Hence, translated data must be incorporated into the training process to improve the performance. Will keep you updated.

janandreschweiger commented 3 years ago

Hey @nreimers thanks for your regular updates. I really look forward to your EN-DE model. Hopefully, the translated data will yield better results. As always thanks for your amazing commitment 👍.

datistiquo commented 3 years ago

Hi @janandreschweiger @geraldwinkler

May I ask for what task you want to apply these models? How do you handle uncased sentences? I tried spacy to case nouns, but this does not catch all words in german...

janandreschweiger commented 3 years ago

He @datistiquo, sorry for replying so late. We use this model for searching hundreds of long text documents (e.g. PDFs). So in our case the knowledge base is cased correctly most of the time.

nero-nazok commented 3 years ago

Hi @nreimers are there any updates regarding the DE-EN cross-encoder that @janandreschweiger and @geraldwinkler requested? In addition, I'd like to know if the msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned is still the best bi-encoder for DE-EN. I am curious, because the name (tmp) indicates that it is not the final model and you mentioned some time ago that you plan on further improving its accuracy.

Thank you!

nreimers commented 3 years ago

So far no updates and no new models.

nero-nazok commented 3 years ago

Ok, thanks for the response @nreimers.

peterchiuglg commented 3 years ago

Hi @janandreschweiger @ace-kay-law-neo Happy to hear that, thanks for the compliments :)

I currently work on creating multi-lingual MS MARCO models for about 20 languages, including German.

But creating this model takes quite some time, as the translating the training data takes quite some time. But for German the translation is luckily already done, so if I find the time I hope I can release a German-English model sooner.

Really looking forward to it! Will it support Chinese too? Well done guys btw!!!

nreimers commented 3 years ago

@peterchiuglg Yes, Chinese will also be part of it.

peterchiuglg commented 3 years ago

@peterchiuglg Yes, Chinese will also be part of it.

Hi @nreimers , thanks a lot! Would it be possible to share when you plan to release it?

UKPLab / sentence-transformers

Multilingual Information Retrieval (MS-Marco Bi-Encoders) #695