beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.54k stars 182 forks source link

Fix bug of retrieving more docs than specified: Use heapq to maintain top-K #117

Closed kwang2049 closed 1 year ago

kwang2049 commented 1 year ago

In the current version of BeIR, the exact_search will store much more documents in the results than specified, since the code actually accumulates the top-k docs for each chunk: https://github.com/beir-cellar/beir/blob/c3334fd5b336dba03c5e3e605a82fcfb1bdf667d/beir/retrieval/search/dense/exact_search.py#L68-L78

This will be a really severe issue when the corpus is large, which will result in a really huge memory burden, e.g. ~50GB for MS MARCO. One could do some simulation about how much RAM it would take:

import numpy as np
import random
import psutil
import tqdm

msmarco_docs = [str(i) for i in range(8800000)]
corpus_chunk_size = 50000
top_k = 1000
queries = 100
# queries = 7000

ram_gb_before = psutil.virtual_memory().available / 1024 / 1024 / 1024

retrieval_results = {
    str(i): dict(
        zip(
            map(
                str,
                random.sample(
                    msmarco_docs, k=len(msmarco_docs) // corpus_chunk_size * top_k
                ),
            ),
            np.random.rand(len(msmarco_docs) // corpus_chunk_size * top_k).tolist(),
        )
    )
    for i in tqdm.trange(queries)
}

ram_gb_after = psutil.virtual_memory().available / 1024 / 1024 / 1024

print(f"{queries} queries takes {ram_gb_before - ram_gb_after:.4f}GB RAM")
# 100 queries takes 0.6912GB RAM

To fix this, for each query (result), a heap is used to maintain the top-K in an efficient manner.

The comparison and results can be found in this Colab notebook: https://colab.research.google.com/drive/1fr8cpJXi750FnPS0eYmPvyDq1VEdrgFy?usp=sharing

kwang2049 commented 1 year ago

@thakur-nandan

thakur-nandan commented 1 year ago

Thanks @kwang2049, for the heapq implementation!

Could we check whether we have identical scores for a larger dataset such as TREC-COVID or MSMARCO? I will have this pull request added in the weekend.

Thanks!

kwang2049 commented 1 year ago

@thakur-nandan thanks for your quick response!

Sure, I have done the check with the script on msmarco (Tesla V100 SXM3 32GB, CUDA 11.6, python3.6.13):

from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download scifact.zip dataset and unzip the dataset
dataset = "msmarco"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "datasets"
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="dev")

#### Load the SBERT model and retrieve using cosine-similarity
model = DRES(models.SentenceBERT("msmarco-distilbert-base-tas-b"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="dot") # or "cos_sim" for cosine similarity
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
print(retriever.evaluate(qrels, results, retriever.k_values))

print("Retrieved #docs per query: ", sum([len(topk) for qid, topk in results.items()]) / len(results))

The current version gives results:

2022-10-06 22:30:21 - Loading Corpus...
...
2022-10-07 02:00:02 - NDCG@1: 0.2205
2022-10-07 02:00:02 - NDCG@3: 0.3307
2022-10-07 02:00:02 - NDCG@5: 0.3694
2022-10-07 02:00:02 - NDCG@10: 0.4076
2022-10-07 02:00:02 - NDCG@100: 0.4650
2022-10-07 02:00:02 - NDCG@1000: 0.4773
2022-10-07 02:00:02 - 

2022-10-07 02:00:02 - MAP@1: 0.2151
2022-10-07 02:00:02 - MAP@3: 0.3011
2022-10-07 02:00:02 - MAP@5: 0.3229
2022-10-07 02:00:02 - MAP@10: 0.3390
2022-10-07 02:00:02 - MAP@100: 0.3509
2022-10-07 02:00:02 - MAP@1000: 0.3514
2022-10-07 02:00:02 - 

2022-10-07 02:00:02 - Recall@1: 0.2151
2022-10-07 02:00:02 - Recall@3: 0.4102
2022-10-07 02:00:02 - Recall@5: 0.5030
2022-10-07 02:00:02 - Recall@10: 0.6187
2022-10-07 02:00:02 - Recall@100: 0.8840
2022-10-07 02:00:02 - Recall@1000: 0.9771
2022-10-07 02:00:02 - 

2022-10-07 02:00:02 - P@1: 0.2205
2022-10-07 02:00:02 - P@3: 0.1415
2022-10-07 02:00:02 - P@5: 0.1045
2022-10-07 02:00:02 - P@10: 0.0645
2022-10-07 02:00:02 - P@100: 0.0093
2022-10-07 02:00:02 - P@1000: 0.0010
Retrieved #docs per query:  177176.97779369628

And the fix gives results:

2022-10-06 22:32:32 - Loading Corpus...
...
2022-10-07 01:36:02 - NDCG@1: 0.2205
2022-10-07 01:36:02 - NDCG@3: 0.3307
2022-10-07 01:36:02 - NDCG@5: 0.3694
2022-10-07 01:36:02 - NDCG@10: 0.4076
2022-10-07 01:36:02 - NDCG@100: 0.4650
2022-10-07 01:36:02 - NDCG@1000: 0.4773
2022-10-07 01:36:02 - 

2022-10-07 01:36:02 - MAP@1: 0.2151
2022-10-07 01:36:02 - MAP@3: 0.3011
2022-10-07 01:36:02 - MAP@5: 0.3229
2022-10-07 01:36:02 - MAP@10: 0.3390
2022-10-07 01:36:02 - MAP@100: 0.3509
2022-10-07 01:36:02 - MAP@1000: 0.3514
2022-10-07 01:36:02 - 

2022-10-07 01:36:02 - Recall@1: 0.2151
2022-10-07 01:36:02 - Recall@3: 0.4102
2022-10-07 01:36:02 - Recall@5: 0.5030
2022-10-07 01:36:02 - Recall@10: 0.6187
2022-10-07 01:36:02 - Recall@100: 0.8840
2022-10-07 01:36:02 - Recall@1000: 0.9771
2022-10-07 01:36:02 - 

2022-10-07 01:36:02 - P@1: 0.2205
2022-10-07 01:36:02 - P@3: 0.1415
2022-10-07 01:36:02 - P@5: 0.1045
2022-10-07 01:36:02 - P@10: 0.0645
2022-10-07 01:36:02 - P@100: 0.0093
2022-10-07 01:36:02 - P@1000: 0.0010
Retrieved #docs per query:  1000.0
thakur-nandan commented 1 year ago

Thanks for computing the scores. The changes look good to me! Thanks for the implementation, i think using a heapq definitely helps with the memory a lot.

I will integrate it over the weekend thanks!

nreimers commented 1 year ago

I raised some concerns about the implementation here: https://github.com/UKPLab/sentence-transformers/pull/1715

Where it might lead to wrong results at the end of the heap

kwang2049 commented 1 year ago

I raised some concerns about the implementation here: UKPLab/sentence-transformers#1715

Where it might lead to wrong results at the end of the heap

Oh, I just saw some mismatch at the end of https://colab.research.google.com/drive/1ibA6hjfXKsl97L1wA_FlT1HnVFhnRw0L?usp=sharing. Let me find out what happened there.

Current version: [{'corpus_id': 1833, 'score': 0.768330454826355}, {'corpus_id': 766, 'score': 0.7678950428962708}, {'corpus_id': 2736, 'score': 0.75567227602005}, {'corpus_id': 973, 'score': 0.7547798752784729}, {'corpus_id': 611, 'score': 0.7537626028060913}, {'corpus_id': 1922, 'score': 0.7512373328208923}, {'corpus_id': 2660, 'score': 0.7499988079071045}, {'corpus_id': 1101, 'score': 0.7471907734870911}, {'corpus_id': 5145, 'score': 0.7467103004455566}, {'corpus_id': 1067, 'score': 0.7465646862983704}]

Fix: [{'corpus_id': 1833, 'score': 0.768330454826355}, {'corpus_id': 766, 'score': 0.7678950428962708}, {'corpus_id': 2736, 'score': 0.75567227602005}, {'corpus_id': 973, 'score': 0.7547798752784729}, {'corpus_id': 611, 'score': 0.7537626028060913}, {'corpus_id': 1922, 'score': 0.7512373328208923}, {'corpus_id': 2660, 'score': 0.7499988079071045}, {'corpus_id': 1101, 'score': 0.7471907734870911}, {'corpus_id': 5145, 'score': 0.7467103004455566}, {'corpus_id': 5112, 'score': 0.7241520285606384}]

kwang2049 commented 1 year ago

Thanks for @nreimers 's eagle eye🦅! I have found the issue: One should use heapq.heappushpop instead of heapq.heapreplace when the heap is full of top-k docs. The difference is, heapq.heapreplace is actually something like "heapq.heappoppush", i.e. it pops the smallest first and then push the new element. heapq.heappushpop, on the other hand, pushes the new element and then pops the smallest. One can refer to the official doc.

I have update the code and also the colab (in the PR desc). Please have a look. 2a3ec6c