Open hanshupe opened 1 year ago
Don't forget that if you want to apply cross-attention with TF-IDF
, it will not be efficient because TF-IDF
can work perfectly with lexical words. For your case, I propose to use a cross-encoder with a biencoder. The biencoder supports semantic search (Which takes care of synonyms).
The proper question to ask is if using TF-IDF
can contribute to reranking scores in cross-attention.
I would recommend using a semantic approach(Biencoder) instead of TF-IDF
or BM25
.
I used biencoder to rank the 100
relevant passages to the query, then apply cross-attention to rerank and get 25
final passages. I did not invent it, I followed Facebook Research paper on arxiv
I hope that I got your point.
My point is more about the drawback of biencoder and crossencoder, that they are typically based on a pretrained model with a fixed vocabulary. To train them from scratch on a domain specific dataset would require large amounts of data and ressources.
In my case, as explained, I have some domain specific project codes (e.g. ADBN, YADSASD, Prometheus project, Blue Bird,...), which are highly relevant to detect similarities between documents. TF-IDF can handle this, because the vocabulary can be built easily, but on the other side it doesn't understand synonyms or even context-based semantics.
Therefore, my question is, if there is any approach to combine sbert and tfidf, to leverage the advantages of both.
I wonder which approach would be recommended in the following scenario and if there is a way to combine the advantages of TF-IDF and CrossEncoders in a hybrid model?
I want to detect similar text documents and also have a small labeled dataset therefore (around 500 positive and 100000 negative examples). A strong indication for a high similarity are project codes or names which are sometimes in the texts (e.g. ADBN, YADSASD, Prometheus project, Blue Bird,...), which are out of any default vocabulary. Of course the content of the texts is relevant too.
I believe that those project names could be identified by TF-IDF, but cannot be leveraged by an SBERT approach. Is this assumption correct? And do you see any way to combine the advantages of both approaches?