UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.35k stars 2.39k forks source link

Supervised Contrastive Learning #987

Open clintg6 opened 3 years ago

clintg6 commented 3 years ago

Dear Nils,

Thank you for this awesome library. I have developed code that takes a paragraph of sentences and converts it into chunks of sentences and retrieves the document whose paragraph chunk was most similar to a sentence chunk in the query. It's performing well but I would like to squeeze out more performance. I have a training dataset of ~250 samples. I'm planning to use contrastive learning loss to fine tune a pretrained STS SBert model specific to the dataset.

I was wondering how you recommend to structure such a dataset for training/finetuning e.g. given a query with sentence chunks A, B, C and positive document chunks D, E, and F should I use all possible combinations as positive training examples and likewise for the negatives or maybe only choose the chunk from the input that is most similar to the chunk in the positive document and for the negative examples the chunk that is least similar to the input chunk?

nreimers commented 3 years ago

Yes, using all combinations sounds sensible.

PaulForInvent commented 3 years ago

I asked this myself also but for single senetnces and not paragraphs: https://github.com/UKPLab/sentence-transformers/issues/978

But I do not understand exactly what your task is.

clintg6 commented 3 years ago

My task is given an input query find the document that is most semantically similar. I looked at your issue. Mine is a bit more simple in that I am not treating the chunk of the input query as positive examples to each other. The only positive examples are between the input chunks and the chunks of the document so the max number of samples generated per query and document is polynomial bound.

You might find this useful for you. I think you need to build a custom dataloader. #608

PaulForInvent commented 3 years ago

Hey, so your input is a single sentence query and you compare it with larger chunks of a document? And you retrieve the doc where the query has the highest confidence with any of the chunks.

Do you split in chunks because of the token limitation?

What loss are you using?

PaulForInvent commented 3 years ago

Just found that @nreimers already implemented a special dataloader for this! It would be great if you knew about this earlier... 😄 Of course this checks only same sentences and not same class. But I think this is straightforward.

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py

clintg6 commented 3 years ago

Yes, using all combinations sounds sensible.

@nreimers If the samples I have are in English and the model being fine tuned is a multilingual model, would it affect the retrieval/semantic similarity performance negatively for languages other than English?

Hey, so your input is a single sentence query and you compare it with larger chunks of a document? And you retrieve the doc where the query has the highest confidence with any of the chunks.

Do you split in chunks because of the token limitation?

What loss are you using?

@PaulForInvent Yep although I'm returning the set of say the top 15 docs whose chunks had the highest confidence. My input is a single query that is split into chunks because of token limitation and the semantic signal gets washed out when there are too many tokens. I'm testing out a couple. ContrastiveLoss and MultipleNegativesRankingLoss. I've thought of trying TripleLoss and CosineSimilarityLoss. What about you?

Just found that @nreimers already implemented a special dataloader for this! It would be great if you knew about this earlier... 😄 Of course this checks only same sentences and not same class. But I think this is straightforward.

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py

Great find thanks!

nreimers commented 3 years ago

Hi @clintg6 Just training on English will not yield good results for other languages. See

https://arxiv.org/abs/2004.09813

clintg6 commented 3 years ago

@nreimers I took at look it but one thing still isn't clear. If I take say quora-distilbert-multilingual and fine-tune it using ContrastiveLoss on some English sentence data, is not reasonable to assume a performance boost for English retrieval and similar retrieval performance to the pretrained quora-distilbert-multilingual for queries in other languages?

The reason I ask is I'm ok if the only boost is in English and I can maintain current pretrained multilingual performance across languages because most query input will be in English.

nreimers commented 3 years ago

If you just train on English, performance on other languages start to suffer.

PaulForInvent commented 3 years ago

My input is a single query that is split into chunks because of

Ok so your input is comparable as large as your docs? I think this symmetric semantic search. My case is short vs short. For this I try ContrastiveLoss / MultipleNegativesRankingLoss (Batchhard...). I think CosineSimilarityLoss is not as good as this works with float label score, which is for binary task maybe not such a good idea...

clintg6 commented 3 years ago

@nreimers If I created translations of the fine tuning dataset in all the languages used to pretrain the multilingual model and then fine tuned the multilingual model with this new dataset, would that overcome the performance issues for the other languages?

@PaulForInvent yes for most cases it will be symmetric semantic search long vs long.

nreimers commented 3 years ago

Yes, that should work.

tide90 commented 3 years ago

I was wondering how you recommend to structure such a dataset for training/finetuning e.g. given a query with sentence chunks A, B, C and positive document chunks D, E, and F should I use all possible combinations as positive training examples and likewise for the negatives or maybe only choose the chunk from the input that is most similar to the chunk in the positive document and for the negative examples the chunk that is least similar to the input chunk?

So, did you manually label which document chunk of the document is the corresponding positive chunk of the input? Even for a positive pair of input and document you have these chunk because of token limitation, and the semantic signal might be only in one specific chunk which means the rest of the chunks need to be labelled as negatives?

clintg6 commented 3 years ago

@tide90 great question. I have not manually labeled which chunks are positive and negative in a positive document. My hope is that the chunks would all have relevant semantic signal and all could be considered positive. If performance isn't sufficient I will have to go down this route of manually labeling chunks.

tide90 commented 3 years ago

ok, so you just labeled the whole document as positive with all its chunks, but trained individually with the single chunks as positive?

clintg6 commented 3 years ago

I'm testing labeling the whole document as positive. Also testing topological data analysis approaches for instance, say I measure the cosine similarity between input and a chunk using SBert. Then if it doesn't meet a certain similarity threshold I've set I'll label it as a negative. I will report back findings when finished.