deepset-ai / haystack-tutorials

Here you can find all the Tutorials for Haystack 📓
https://haystack.deepset.ai/tutorials
Apache License 2.0
268 stars 90 forks source link

Tutorial 09: Update to EmbeddingRetriever Training #35

Closed bglearning closed 2 weeks ago

bglearning commented 2 years ago

Overview

With deepset-ai/haystack#2887, we replaced DPR with EmbeddingRetriever in Tutorial 06.

Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.

Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?

Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.

Training EmbeddingRetriever

Only the sentence-encoder variant of EmbeddingRetriever can be trained.

Its train method does some data setup and then calls the fit method on SentenceTransformer (from the sentence_transformers package).

Input data format is:

[
{”question”: …, “pos_doc”: …, “neg_doc”: …, “score”: …}, 
... 
]

It uses MarginMSELoss (as part of the GPL procedure).

Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score above, right? So there doesn't seem to be a download-and-use form of dataset available?

RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary) cc: @mkkuemmel

mkkuemmel commented 2 years ago

Q1. I think doing a 9A and 9B tutorial would be good? I'd let them coexist. I could also be convinced to replace the existing (DPR) one, but it's always a little painful to "throw away" useful information! 😄

Q2. The data format strikes me as somewhat difficult, also for users, due to the score field. Is there a different loss we could implement to be able to use an "easier to create", more accessible data format? @vblagoje @julian-risch do you have any experiences with this? What are your suggestions?

vblagoje commented 2 years ago

I think 9A and 9B make total sense. Before we dive into EmbeddingRetriever training, does it even make sense now with these V3 models from sentence transformers? Perhaps one would potentially need to GPL adapt for particular domain data only. @mkkuemmel and the team might have better insights from the field...

mathislucka commented 2 years ago

There is this dataset with cross-encoder scores for MarginMSELoss: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives

I'd vote for implementing MultipleNegativesRankingLoss because MarginMSE is already used in GPL and MNRL also yields very good results. What do you mean by v3 models @vblagoje ?

mathislucka commented 2 years ago

Oh and training definitely makes sense. If you have labeled data, you will get much better results with training than with the out-of-the-box models.

vblagoje commented 2 years ago

Ok, cool, good to know @mathislucka . It's a rough naming scheme Nils used for his msmarco and likely other models. So the latest models we want are likely these V5 trained on MarginMSE loss?

bglearning commented 2 years ago

So to take stock:

Does training EmbeddingRetrievers make sense? Yes, definitely helps if labeled data is available.

Which sentence-transformer model(s) do we suggest for out-of-the-box use? Now it makes sense to use and promote v5 models (?)

What procedure do we suggest for fine-tuning? What format must the data be in? Here possible options: Opt-1: Suggest users to convert/collect data into the format with teacher encoder scores. Use MarginMSE loss. Opt-2: We add in support for MultipleNegativesRankingLoss and suggest its usage (as user wouldn't need teacher encoder scores). Opt-3: ...?

Which do we go for?

mkkuemmel commented 2 years ago

As for the options, my higher-level opinion is to find a good trade-off between "scientific correctness" and feasibility for the users. About the latter:

convert/collect data into the format with teacher encoder scores

How feasible is this for users? Can we guide them how to do it?

About "scientific correctness": Which one of the losses seems to be the more sensible for this task?

mathislucka commented 2 years ago

My 2 cents:

Which sentence-transformer model(s) do we suggest for out-of-the-box use? Now it makes sense to use and promote v5 models (?)

The v5 models are only for msmarco. We have seen with clients that all-mpnet-base-v2 or multi-qa-mpnet-base-dot-v1 usually perform best. So I think we should recommend these models. Maybe add multi-qa-MiniLM-L6-cos-v1 as an option for a small and fast model and paraphrase-multilingual-mpnet-base-v2 as a multi-lingual model.

What procedure do we suggest for fine-tuning? What format must the data be in?

Go with Opt-2 as it is simpler than MarginMSE and existing datasets can be re-used.

bglearning commented 2 years ago

Hi,

So based on discussions above, am pivoting to adding MultipleNegativesRankingLoss support to the training of EmbeddingRetriever. Opened an issue for it here: deepset-ai/haystack#3136

Can get back to this Tutorial rework once that is resolved/completed.

sinchanabhat commented 2 years ago

Hi,

Can we expect the tutorial for fine-tuning using embedding retriever soon using GPL train data maybe?

bglearning commented 2 years ago

Hi @sinchanabhat,

Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. 😅

In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss. (Edit: there is already a tutorial for GPL training based on the first notebook. Please checkout that one 😄).

The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164).

Both notebooks are not Tutorials per se (so not as polished) but might be helpful still.

sinchanabhat commented 2 years ago

Hi @sinchanabhat,

Ya, the tutorial is coming soon-ish. Can't commit to a time frame but a median estimate could be end of next week. sweat_smile

In the meantime, you can checkout this notebook showcasing GPL training or this with MultipleNegativesRankingLoss. (Edit: there is already a tutorial for GPL training based on the first notebook. Please checkout that one smile).

The latter was a recent change following discussions above (can see details in the PR: deepset-ai/haystack#3164).

Both notebooks are not Tutorials per se (so not as polished) but might be helpful still.

Thanks alot for directing me to the notebooks. I have gone through them and pardon me for asking this (even if my question might sound stupid), when we talk about adapting the retriever to gpl data, doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is?

bglearning commented 2 years ago

doesn't the train/fine-tune involve early stopping or taking the best model as the model with best validation metric ? Or is it just running for 5 to 10 epochs and evaluating how good the retriever is?

Ah yes, generally would involve monitoring and acting on val metric as you mentioned (for instance, performance might plateau after some steps as in the GPL paper figure2+section6.1). The tutorial is more for a demonstration of how to setup and perform the training.

sinchanabhat commented 2 years ago

Got it! Thank you so much for the quick revert! @bglearning

vibha0411 commented 1 year ago

This is a related question @bglearning I see that [https://colab.research.google.com/drive/1Tz9GSzre7JfvXDDKe7sCnO0FMuDViMnN?usp=sharing#scrollTo=TD2PZuuNTpQ3] PseudoLabelGenerator runs basically Question generation - optional step Negative mining Pseudo labeling (margin scoring) steps. But in a case where i already have the positive and negative documents for a question, I would not be requiring the negative mining step, however as per my understanding I still require the margin scoring to perform the training.

So how do I go about this scenario? Kindly help.

bglearning commented 1 year ago

Hey @vibha0411 ,

If you use MultipleNegativesRankingLoss (train_loss='mnrl', currently the default), the scores aren't required. [1]

In fact, for MNRL even having the negative docs is optional because it considers all other docs in a batch as negative. The MNRL example colab notebook linked above might be useful. It only uses positive docs but of course it's good if you already have negatives and you can pass those too.

Roughly would be something like:

embedding_retriever = EmbeddingRetriever(...)
training_data = [
    {"question": ..., "pos_doc": ..., "neg_doc": ...}, 
    ...
]
embedding_retriever.train(training_data=training_data, train_loss='mnrl')  

[1] Margin scoring would be required when training with margin_mse.

vibha0411 commented 1 year ago

Thank you for your reply. Yes, I did have a look at MNRL as well. But in my case since the customer feedback contains the explicit negative feedback, I am looking for a loss function that can leverage this handpicked negative feedback.

Hence I thought it might be more sensible to go with margin scoring..

But not sure what is the best way to go about it.

vibha0411 commented 1 year ago

I think i got your point. You mean that mnrl will consider the negative that we explicitly provide to be the hard negative so that will help. Got it! Thanks

vibha0411 commented 1 year ago

Also is there anyway to monitor the training?