How does it compare to S-BERT + Elastic vector field?

pommedeterresautee commented 4 years ago

follow up of https://github.com/koursaros-ai/nboost/issues/32 (since it s more about S-BERT than nboost)

The task I am working on is similarity search between documents having a length between 100 and 10000 characters (1 to few paragraphs, and most of the time more than one sentence). I have pairs of documents semantically related, they most of the time have large vocabulary gap (but it is not mandatory). For each positive pair I generate a random negative pair. Positive pair have a similarity score of 1 and negative pair a similarity score of 0. Dataset is 30K positive examples large with very little noise (manual random check).

I tried 2 strategies, S-BERT and simple TF-IDF (no search engine, I just built my own sparse matrix do neighbourhood search on it). For S-BERT I use the CosineSimilarityLoss. I tried Triplet loss in the pas but results was disappointing, since the very few batches the triplet loss provided perfect results... which make sense as the data are easy to guess. For CosineSimilarityLoss it is harder as it's a kind of regression, but rapidly I tend to reach the max Pearson (0.99). Spearman top at 0.86.

My finding so far is that TF-IDF approach provides much better results most of the time. S-BERT results are very bad when doing a search just using generated vectors (cosine distance).

I also tried TF IDF and rerank top 100 with vectors and found that it doesn't brought any improvement (qualitative appreciation, no measure on this specific setup).

Regarding measures... TF IDF score is lower on test set than S-BERT:

spearman: 0.7 vs 0.99
pearson: 0.8 vs 0.86
Those measures come from Scikit and match those from similarity_evaluation_results.csv generated from S-BERT

Any idea why? how is it possible? I am sure I miss something obvious... That's why I am wondering if nboost may be a good choice.

nreimers commented 4 years ago

Hi, just some comments: 1) S-BERT is (in my opinion) more suitable for short texts like sentences. If your docs get longer (paragraphs or multiple paragraphs), TF-IDF / BM25 makes more sense. BERT has a limit of 512 sentence-piece tokens (which are about 300-400 words). Further, fine-tuning BERT for longer text to produce sentence representations is difficult (you require a lot of memory, GPU resources, data). So the current SBERT models are capped at 128 sentence-piece tokens (maybe ~70 words).

2) For triplet loss, the selection of the negative example is quite important. It happens quickly that the negative examples are too easy and the learning of the model stalls. Maybe have a look at BatchHardTripletLoss and MultipleNegativeRankingLoss, which do not use only one but many different negatives. Further, larger batch sizes are quite favourable for these losses, but with large batch sizes you need a lot of GPU memory (32 GB or more).

In your use-case, I think reranking with BERT (as it is done in nboost), is the better option.

pommedeterresautee commented 4 years ago

To keep you informed, as advised I tried to formulate the re ranking task as a classification one and got quite good results. I think I will continue that way. Thank you a lot for your advices!

hockeybro12 commented 4 years ago

@pommedeterresautee can you elaborate a bit more on what you did and how it has worked with you? I am having the same issues (good results in Triplet Loss accuracy, but bad results in my evaluation). Did you use S-BERT with paragraphs still? How did you set up your classification task?

pommedeterresautee commented 4 years ago

I don't use anymore S-BERT. I just perform a classification task of pairs and sort according to score for the class 1.

hockeybro12 commented 4 years ago

@pommedeterresautee Are you saying you fine-tuned regular BERT on the classification task of given a pair of sentences, are they in the same pair (semantically related) or not?

pommedeterresautee commented 4 years ago

Yes that is what I meant. However finding the right negative examples is important as random examples are too easy and the model get 100% accuracy after the first epoch.

hockeybro12 commented 4 years ago

@pommedeterresautee Ok, why did you not fine-tune SBERT on that task? And can you share any tips to finding the right negative examples?

pommedeterresautee commented 4 years ago

I trained S Bert on the very same data. But pre computing representation was in my case not useful. Regarding the way I generated negative examples I can’t share, it’s specific to our data.

hockeybro12 commented 4 years ago

@pommedeterresautee Did you still use paragraphs?

UKPLab / sentence-transformers

How does it compare to S-BERT + Elastic vector field? #99