beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.49k stars 177 forks source link

AssertionError: Elastic-Search Window too large, Max-Size = 10000 #140

Open zuliani99 opened 1 year ago

zuliani99 commented 1 year ago

Using BM25, for sparse embedding in a pretty big datasets (eg. FiQA), I get the following assertion error: AssertionError: Elastic-Search Window too large, Max-Size = 10000

The function that call BM25 is the next one:

def sparse_embeddings_bm25(dataset_name, corpus, queries, qrels, k_primes):
  '''
  PURPOSE: compute the sparse embedding using the BM25 implementation from beir and elastichsearch
  ARGUMENTS:
    - dataset_name: string describing the dataset name
    - corpus: sequence of documents 
    - queries: sequence of queries
    - qrels: ground truth of query document relevance
    - k_primes: list of number of top k prime documents to return
  RETURN: see embeddings return values
  '''
  hostname = 'localhost' 
  index_name = dataset_name
  initialize = True # Delete existing index with same name and reindex all documents

  print(f'{dataset_name} - BM25')
  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize) # Defining the BM25
  return embeddings('Sparse', model, corpus, queries, qrels, k_primes)

I've already tryed to create the index before running BM25 and set initialize = False, but doing so I need somewhat to pass to the index the corpus and the queries.

Note that I'm running all the application in Google Colab Pro, I don't know if this is important or not.