Using BM25, for sparse embedding in a pretty big datasets (eg. FiQA), I get the following assertion error:
AssertionError: Elastic-Search Window too large, Max-Size = 10000
The function that call BM25 is the next one:
def sparse_embeddings_bm25(dataset_name, corpus, queries, qrels, k_primes):
'''
PURPOSE: compute the sparse embedding using the BM25 implementation from beir and elastichsearch
ARGUMENTS:
- dataset_name: string describing the dataset name
- corpus: sequence of documents
- queries: sequence of queries
- qrels: ground truth of query document relevance
- k_primes: list of number of top k prime documents to return
RETURN: see embeddings return values
'''
hostname = 'localhost'
index_name = dataset_name
initialize = True # Delete existing index with same name and reindex all documents
print(f'{dataset_name} - BM25')
model = BM25(index_name=index_name, hostname=hostname, initialize=initialize) # Defining the BM25
return embeddings('Sparse', model, corpus, queries, qrels, k_primes)
I've already tryed to create the index before running BM25 and set initialize = False, but doing so I need somewhat to pass to the index the corpus and the queries.
Note that I'm running all the application in Google Colab Pro, I don't know if this is important or not.
Using BM25, for sparse embedding in a pretty big datasets (eg. FiQA), I get the following assertion error:
AssertionError: Elastic-Search Window too large, Max-Size = 10000
The function that call BM25 is the next one:
I've already tryed to create the index before running BM25 and set initialize = False, but doing so I need somewhat to pass to the index the corpus and the queries.
Note that I'm running all the application in Google Colab Pro, I don't know if this is important or not.