Creating the training dataset with long documents

paulomann commented 4 years ago

Hi Nils, thanks for the fantastic work.

Considering an information retrieval system with the two-step approach: (1) BM-25; (2) Re-ranking, do you have any thoughts on what is the best way to create a labeled training dataset for the (2) phase when we usually have big documents (typically 2000+ words)?

I have seen many different approaches in the literature: (a) given an anchor query document q , label the most similar document d; (b) given an anchor query document q, label the most similar document d and most dissimilar document n; (c) do the process in (a) and use negative sampling to find the most dissimilar documents n.

Q1: Given that we can create the dataset in multiple ways, I am not sure whether we should choose option (a), (b), or (c). Do you have any thoughts about the best way to label a dataset when documents are big (2000+ words)? Q2: Also, how many times an anchor document (query) should appear in the dataset with many different similar/dissimilar documents? Q3: In the (a) case, we use the CrossEntropy loss for training, and use a subset of the documents (the batch) for the softmax function, right? Q4: In the (b) and (c) cases, we use a Triplet loss function, is that right? Q5: We can also train with the CosineSimilarityLoss, but in the end, we need the label whether a given query is similar or not to the other document we are comparing, right?

nreimers commented 4 years ago

Hi @paulomann You usually achieve the best performance when you have besides your positive example also a hard negative example. A hard negative example is an example that is similar to your positive example, but it is invalid result for the anchor.

Getting the hard negative example is sadly not always easy. A good option is to use BM25 and to find a similar example to your positive example and then to annotate it (is is actually a negative example or is is maybe another positive example).

Yes, you would use triplet loss function for this.

Best Nils Reimers

paulomann commented 4 years ago

Thank you for the insightful answer.

iknoorjobs commented 4 years ago

Hi @nreimers @paulomann

For the training a SBERT for long documents, we need to increase the sequence length to max (i.e. 512) but according to https://github.com/UKPLab/sentence-transformers/issues/99#issuecomment-574150439 , the SBERT models are capped at 128 tokens. Is it possible to change this max seq length parameter if we're fine-tuning the already existing SBERT models (like distilbert-base-nli-stsb-mean-tokens) on our custom data?

Thanks Iknoor

nreimers commented 4 years ago

@iknoorjobs Have a look here: https://www.sbert.net/docs/usage/computing_sentence_embeddings.html#input-sequence-length

iknoorjobs commented 4 years ago

@nreimers Great. Thanks for your prompt response.

UKPLab / sentence-transformers

Creating the training dataset with long documents #226