UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.37k stars 2.39k forks source link

semantic search suggestion finding relevant documents #126

Open manishiitg opened 4 years ago

manishiitg commented 4 years ago

I followed your conversion here and your explanations it really help me a lot on how to implement semantic search.

https://github.com/huggingface/transformers/issues/876

I wanted your suggestions on this further, if you could help me out.

The problem i am trying to solve is to implement semantic search on a database of candidate resumes. So i have lot of resumes of candidate who apply to job positions, i want to be able to search and find relevant candidates effectively.

the approach i am planning to implement is elastic search (BM25) and on top of it planning to use sentence transformers for re ranking.

By questions are a) Should i train using word embedding like word2vec (i have a custom trained word2vec on a large dataset for this) or use contextual embeddings like bert, robert ?

b) For what i looks like sentence transformers are effective when applied on smaller sentences, but a resume is a full document. so i encode the entire resume to a single vector? or maybe encode it section wise or something else?

c) is BM25 + re ranking via sentence transformer the best solution for this. or should do semantic search using fiass on the full data corpus.

Would be grateful for your help on this.

Thanks

nreimers commented 4 years ago

Hi @manishiitg Personally I think BM25 is an extremely strong system if done right and with some tuning (thinks to tune: edge-n-gram indexing, bigram indexing, word normalization, stop-word-lists etc.). For larger documents, it is often far superior for dense vector approaches.

In order that dense vector approaches work well, you often need a lot of training data that matches your use-case. However, you seldom have so much training data.

I would recommend c): First try BM25 and try to get it to work as good as possible.

If you have then training data from your task, you can combine it with a re-ranking approach. For that, I would use a sentence pair classifier like BERT and not Sentence Transformer. With BM25 you retrieve for example 100 results, than you classify every pair with BERT if it is relevant for your query and your return the top 10 results to the user.

A limitation of BERT is, that it works only for documents up to 512 word pieces, which are maybe ~400 tokens. This can be too short for a long document. In that case, you can use e.g. a window approach: Break down your long document into shorter chunks (e.g. individual sentence or multiple sentences), then classify all sentences and take the max-value for your document prediction.

Best Nils Reimers

manishiitg commented 4 years ago

Thanks for your reply. I will try above definitely but can BM25 even with fine tuning as you mentioned search for similar words (which a word2vec model) could do?

like if i search for example Backend Developer, can BM25 ever match it to Nodejs Developer or Backend Developer?

nreimers commented 4 years ago

If you search for 'backend developer', it would match with 'nodejs developer' due to the same 'developer' word. But it would not match with people that just have 'nodejs' or 'flask' in their CV.

In theory this sounds bad, but in practice it is often not that bad. The longer your document, the less of an issue this gets.

Another technique that is quite powerful is query expansion. You have your query 'backend developer'. Then you have a model that knows about related terms and generates for 'backend' the terms 'nodejs' and 'flask'.

To your Elasticsearch instance, you then send 3 queries: backend developer nodejs developer flask developer

You merge the results and present it to the user. Even though the user did not enter nodejs, it will see candidates that have nodejs in their CV.

This query expansion is quite extensively done by Google.

================================

In general it depends what you want from your search. Do you want: 1) High precision and low recall, i.e., matched results are relevant, but you might miss candidates

OR

2) Low precision but high recall: The matched results contain a lot of garbage, but you do not miss relevant candidates.

The first case is the more common desired properties for information retrieval and BM25 is really strong in this. If you need 2) and you can deal with the garbage in the results, then a dense vector representation might be better.

manishiitg commented 4 years ago

makes sense. will experiment with different things

zhenliu2012 commented 4 years ago

Wanted to share an article that I came across: the paper discusses large scale retrieval using embedding based method compared to BM25. https://arxiv.org/abs/2002.03932

aditya-malte commented 3 years ago

Really great insights on JD-CV retrieval @nreimers. However, with:

  1. Longformer model available
  2. a large amount of data(>100k points) of relevance scores What would be your opinion today?

P.S. In our case the problem with using cross-encoders (aka pair classifiers) is the deployment challenges and costs, also reranking can only be run on (say) 100 candidates out of tens of thousands of results.

Thanks in advance