UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.88k stars 2.44k forks source link

CrossEncoders with TF-IDF #1843

Open hanshupe opened 1 year ago

hanshupe commented 1 year ago

I wonder which approach would be recommended in the following scenario and if there is a way to combine the advantages of TF-IDF and CrossEncoders in a hybrid model?

I want to detect similar text documents and also have a small labeled dataset therefore (around 500 positive and 100000 negative examples). A strong indication for a high similarity are project codes or names which are sometimes in the texts (e.g. ADBN, YADSASD, Prometheus project, Blue Bird,...), which are out of any default vocabulary. Of course the content of the texts is relevant too.

I believe that those project names could be identified by TF-IDF, but cannot be leveraged by an SBERT approach. Is this assumption correct? And do you see any way to combine the advantages of both approaches?

bm777 commented 1 year ago

Don't forget that if you want to apply cross-attention with TF-IDF, it will not be efficient because TF-IDF can work perfectly with lexical words. For your case, I propose to use a cross-encoder with a biencoder. The biencoder supports semantic search (Which takes care of synonyms).

The proper question to ask is if using TF-IDF can contribute to reranking scores in cross-attention. I would recommend using a semantic approach(Biencoder) instead of TF-IDF or BM25.

I used biencoder to rank the 100 relevant passages to the query, then apply cross-attention to rerank and get 25 final passages. I did not invent it, I followed Facebook Research paper on arxiv

I hope that I got your point.

hanshupe commented 1 year ago

My point is more about the drawback of biencoder and crossencoder, that they are typically based on a pretrained model with a fixed vocabulary. To train them from scratch on a domain specific dataset would require large amounts of data and ressources.

In my case, as explained, I have some domain specific project codes (e.g. ADBN, YADSASD, Prometheus project, Blue Bird,...), which are highly relevant to detect similarities between documents. TF-IDF can handle this, because the vocabulary can be built easily, but on the other side it doesn't understand synonyms or even context-based semantics.

Therefore, my question is, if there is any approach to combine sbert and tfidf, to leverage the advantages of both.