dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
983 stars 83 forks source link

Difference between get_batch_scores and get_scores method #10

Open soumya-ranjan-sahoo opened 4 years ago

soumya-ranjan-sahoo commented 4 years ago

Hi Team,

I would need your help here! To give you a brief overview, I have about 500k documents in my corpus and I have only a set of 7k queries-document pairs, and I want to calculate the BM25 scores for each of these individual pairs. To start with -

  1. I have indexed all the 500k documents
  2. I understand I can use get_scores method to get the bm25 scores for all the 500k documents, which is a 500k vector, and then I can index the vector for each of my query-document indexes, i. For example - For a given query with index i, the score for query-document pair with index i, will be bm25score[i]. But this method takes ages to calculate the scores, and hence I was looking for a way around. Can the method get_batch_scores, be of any help here. My guess is it would only index the subset of the documents provided to the method and not all 500k documents.

My objective is to index 500k documents, and then given query-document pair, I have to calculate the bm25 scores.

Thanks in advance!

soumya-ranjan-sahoo commented 3 years ago

Can someone kindly help me answer this? I want to know how get_batch_scores is different from get_scores?