beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.54k stars 182 forks source link

Different results using DenseRetrievalExactSearch & DenseRetrievalParallelExactSearch #104

Open Muennighoff opened 2 years ago

Muennighoff commented 2 years ago

I'm using DenseRetrievalExactSearch & DenseRetrievalParallelExactSearch with https://huggingface.co/sentence-transformers/sentence-t5-xxl on SciFact and getting different results. I'm using them in mteb, which just wraps them here so I think this is an issue with BEIR. Did anyone confirm that they produce the same results? cc @NouamaneTazi

DenseRetrievalExactSearch:

{
  "test": {
    "evaluation_time": 787.86,
    "map_at_1": 0.41606,
    "map_at_10": 0.50777,
    "map_at_100": 0.51611,
    "map_at_1000": 0.51655,
    "map_at_3": 0.47967,
    "map_at_5": 0.49714,
    "ndcg_at_1": 0.44,
    "ndcg_at_10": 0.5538,
    "ndcg_at_100": 0.59487,
    "ndcg_at_1000": 0.60719,
    "ndcg_at_3": 0.50475,
    "ndcg_at_5": 0.5294,
    "precision_at_1": 0.44,
    "precision_at_10": 0.078,
    "precision_at_100": 0.01003,
    "precision_at_1000": 0.00111,
    "precision_at_3": 0.20333,
    "precision_at_5": 0.13933,
    "recall_at_1": 0.41606,
    "recall_at_10": 0.68456,
    "recall_at_100": 0.881,
    "recall_at_1000": 0.98,
    "recall_at_3": 0.55044,
    "recall_at_5": 0.61194
  }
}

DenseRetrievalParallelExactSearch:

{
  "test": {
    "evaluation_time": 171.44,
    "map_at_1": 0.37106,
    "map_at_10": 0.45417,
    "map_at_100": 0.46099,
    "map_at_1000": 0.46138,
    "map_at_3": 0.43122,
    "map_at_5": 0.4442,
    "ndcg_at_1": 0.39333,
    "ndcg_at_10": 0.49823,
    "ndcg_at_100": 0.53358,
    "ndcg_at_1000": 0.54726,
    "ndcg_at_3": 0.45594,
    "ndcg_at_5": 0.47503,
    "precision_at_1": 0.39333,
    "precision_at_10": 0.06967,
    "precision_at_100": 0.00887,
    "precision_at_1000": 0.00101,
    "precision_at_3": 0.18444,
    "precision_at_5": 0.12333,
    "recall_at_1": 0.37106,
    "recall_at_10": 0.61889,
    "recall_at_100": 0.78967,
    "recall_at_1000": 0.90383,
    "recall_at_3": 0.50033,
    "recall_at_5": 0.54961
  }
}
NouamaneTazi commented 2 years ago

Yes, I did confirm their equivalence a while ago for some tasks only. It could be the difference in the number of final results we return:

cc @thakur-nandan

thakur-nandan commented 2 years ago

Hi @Muennighoff and @NouamaneTazi, thanks for pointing out the issue.

In theory, both are exact search methods so the results should be identical. I'm surprised we are getting a difference between the results. @NouamaneTazi we experimented and confirmed equivalence for which datasets?

Can we check the top-10 documents for a few sample queries and see the overlap in DRES and DRPES?

Kind regards, Nandan Thakur

Muennighoff commented 2 years ago

Yes, I did confirm their equivalence a while ago for some tasks only. It could be the difference in the number of final results we return:

cc @thakur-nandan

I ran a test adding this code to DRES, which reduces the results to top_k + 1 after each batch. So It's exactly top_k+1 like in DPRES at the end. It produced the same results as without, but DPRES still produces different results. SciFact with kom. on 2 GPUs:

DRES with/w/o the modif:

{
  "test": {
    "evaluation_time": 4.5,
    "map_at_1": 0.18667,
    "map_at_10": 0.25737,
    "map_at_100": 0.26597,
    "map_at_1000": 0.26716,
    "map_at_3": 0.23662,
    "map_at_5": 0.25047,
    "ndcg_at_1": 0.2,
    "ndcg_at_10": 0.29526,
    "ndcg_at_100": 0.34256,
    "ndcg_at_1000": 0.3779,
    "ndcg_at_3": 0.25674,
    "ndcg_at_5": 0.28073,
    "precision_at_1": 0.2,
    "precision_at_10": 0.04433,
    "precision_at_100": 0.00703,
    "precision_at_1000": 0.00102,
    "precision_at_3": 0.10778,
    "precision_at_5": 0.07867,
    "recall_at_1": 0.18667,
    "recall_at_10": 0.39944,
    "recall_at_100": 0.63361,
    "recall_at_1000": 0.91622,
    "recall_at_3": 0.30194,
    "recall_at_5": 0.35806
  }

DPRES

{
  "test": {
    "evaluation_time": 15.18,
    "map_at_1": 0.18,
    "map_at_10": 0.24946,
    "map_at_100": 0.25776,
    "map_at_1000": 0.25898,
    "map_at_3": 0.22995,
    "map_at_5": 0.24255,
    "ndcg_at_1": 0.19333,
    "ndcg_at_10": 0.28628,
    "ndcg_at_100": 0.33218,
    "ndcg_at_1000": 0.36891,
    "ndcg_at_3": 0.25007,
    "ndcg_at_5": 0.27175,
    "precision_at_1": 0.19333,
    "precision_at_10": 0.043,
    "precision_at_100": 0.00683,
    "precision_at_1000": 0.00102,
    "precision_at_3": 0.10556,
    "precision_at_5": 0.076,
    "recall_at_1": 0.18,
    "recall_at_10": 0.38778,
    "recall_at_100": 0.61528,
    "recall_at_1000": 0.90956,
    "recall_at_3": 0.29528,
    "recall_at_5": 0.34639
  }
NouamaneTazi commented 2 years ago

Interesting. Can you share the code you use for testing @Muennighoff , and your environment please? For me I do get closer results between DRES and DRPES by running this code:

import logging

from mteb import MTEB
from sentence_transformers import SentenceTransformer

logging.basicConfig(level=logging.INFO)

if __name__ == '__main__':

    model_name = "average_word_embeddings_komninos"
    model = SentenceTransformer(model_name)
    evaluation = MTEB(tasks="SciFact")
    evaluation.run(model, output_folder=None, eval_splits=["test"], corpus_chunk_size=50000)
(Output using DRPES)
INFO:root:NDCG@1: 0.2000
INFO:root:NDCG@3: 0.2567
INFO:root:NDCG@5: 0.2807
INFO:root:NDCG@10: 0.2953
INFO:root:NDCG@100: 0.3426
INFO:root:NDCG@1000: 0.3779
INFO:root:

INFO:root:MAP@1: 0.1867
INFO:root:MAP@3: 0.2366
INFO:root:MAP@5: 0.2505
INFO:root:MAP@10: 0.2574
INFO:root:MAP@100: 0.2660
INFO:root:MAP@1000: 0.2672
INFO:root:

INFO:root:Recall@1: 0.1867
INFO:root:Recall@3: 0.3019
INFO:root:Recall@5: 0.3581
INFO:root:Recall@10: 0.3994
INFO:root:Recall@100: 0.6336
INFO:root:Recall@1000: 0.9162
INFO:root:

INFO:root:P@1: 0.2000
INFO:root:P@3: 0.1078
INFO:root:P@5: 0.0787
INFO:root:P@10: 0.0443
INFO:root:P@100: 0.0070
INFO:root:P@1000: 0.0010
INFO:mteb.evaluation.MTEB:Evaluation for SciFact on test took 44.32 seconds
INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.2, 'ndcg_at_3': 0.25674, 'ndcg_at_5': 0.28073, 'ndcg_at_10': 0.29526, 'ndcg_at_100': 0.34256, 'ndcg_at_1000': 0.3779, 'map_at_1': 0.18667, 'map_at_3': 0.23662, 'map_at_5': 0.25047, 'map_at_10': 0.25737, 'map_at_100': 0.26597, 'map_at_1000': 0.26716, 'recall_at_1': 0.18667, 'recall_at_3': 0.30194, 'recall_at_5': 0.35806, 'recall_at_10': 0.39944, 'recall_at_100': 0.63361, 'recall_at_1000': 0.91622, 'precision_at_1': 0.2, 'precision_at_3': 0.10778, 'precision_at_5': 0.07867, 'precision_at_10': 0.04433, 'precision_at_100': 0.00703, 'precision_at_1000': 0.00102, 'evaluation_time': 44.32}
--DONE--
(Output using DRES)

Time taken to retrieve: 2.40 seconds
INFO:root:

INFO:root:NDCG@1: 0.2000
INFO:root:NDCG@3: 0.2567
INFO:root:NDCG@5: 0.2807
INFO:root:NDCG@10: 0.2953
INFO:root:NDCG@100: 0.3426
INFO:root:NDCG@1000: 0.3779
INFO:root:

INFO:root:MAP@1: 0.1867
INFO:root:MAP@3: 0.2366
INFO:root:MAP@5: 0.2505
INFO:root:MAP@10: 0.2574
INFO:root:MAP@100: 0.2660
INFO:root:MAP@1000: 0.2672
INFO:root:

INFO:root:Recall@1: 0.1867
INFO:root:Recall@3: 0.3019
INFO:root:Recall@5: 0.3581
INFO:root:Recall@10: 0.3994
INFO:root:Recall@100: 0.6336
INFO:root:Recall@1000: 0.9162
INFO:root:

INFO:root:P@1: 0.2000
INFO:root:P@3: 0.1078
INFO:root:P@5: 0.0787
INFO:root:P@10: 0.0443
INFO:root:P@100: 0.0070
INFO:root:P@1000: 0.0010
INFO:mteb.evaluation.MTEB:Evaluation for SciFact on test took 2.55 seconds
INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.2, 'ndcg_at_3': 0.25674, 'ndcg_at_5': 0.28073, 'ndcg_at_10': 0.29526, 'ndcg_at_100': 0.34256, 'ndcg_at_1000': 0.3779, 'map_at_1': 0.18667, 'map_at_3': 0.23662, 'map_at_5': 0.25047, 'map_at_10': 0.25737, 'map_at_100': 0.26597, 'map_at_1000': 0.26716, 'recall_at_1': 0.18667, 'recall_at_3': 0.30194, 'recall_at_5': 0.35806, 'recall_at_10': 0.39944, 'recall_at_100': 0.63361, 'recall_at_1000': 0.91622, 'precision_at_1': 0.2, 'precision_at_3': 0.10778, 'precision_at_5': 0.07867, 'precision_at_10': 0.04433, 'precision_at_100': 0.00703, 'precision_at_1000': 0.00102, 'evaluation_time': 2.55}
--DONE--
NouamaneTazi commented 2 years ago

And after further investigation, it does seem that DRPES results depend on corpus_chunk_size

For example, using:

evaluation.run(model, output_folder=None, eval_splits=["test"])

would set corpus_chunk_size=260 which greatly influence the final results:

(Output using DRPES corpus_chunk_size=None)
Time taken to retrieve: 11.89 seconds
INFO:root:

INFO:root:NDCG@1: 0.2000
INFO:root:NDCG@3: 0.2567
INFO:root:NDCG@5: 0.2807
INFO:root:NDCG@10: 0.2947
INFO:root:NDCG@100: 0.3419
INFO:root:NDCG@1000: 0.3773
INFO:root:

INFO:root:MAP@1: 0.1867
INFO:root:MAP@3: 0.2366
INFO:root:MAP@5: 0.2505
INFO:root:MAP@10: 0.2570
INFO:root:MAP@100: 0.2656
INFO:root:MAP@1000: 0.2668
INFO:root:

INFO:root:Recall@1: 0.1867
INFO:root:Recall@3: 0.3019
INFO:root:Recall@5: 0.3581
INFO:root:Recall@10: 0.3978
INFO:root:Recall@100: 0.6319
INFO:root:Recall@1000: 0.9146
INFO:root:

INFO:root:P@1: 0.2000
INFO:root:P@3: 0.1078
INFO:root:P@5: 0.0787
INFO:root:P@10: 0.0440
INFO:root:P@100: 0.0070
INFO:root:P@1000: 0.0010
INFO:mteb.evaluation.MTEB:Evaluation for SciFact on test took 12.02 seconds
INFO:mteb.evaluation.MTEB:Scores: {'ndcg_at_1': 0.2, 'ndcg_at_3': 0.25674, 'ndcg_at_5': 0.28073, 'ndcg_at_10': 0.29465, 'ndcg_at_100': 0.34194, 'ndcg_at_1000': 0.37729, 'map_at_1': 0.18667, 'map_at_3': 0.23662, 'map_at_5': 0.25047, 'map_at_10': 0.257, 'map_at_100': 0.2656, 'map_at_1000': 0.26679, 'recall_at_1': 0.18667, 'recall_at_3': 0.30194, 'recall_at_5': 0.35806, 'recall_at_10': 0.39778, 'recall_at_100': 0.63194, 'recall_at_1000': 0.91456, 'precision_at_1': 0.2, 'precision_at_3': 0.10778, 'precision_at_5': 0.07867, 'precision_at_10': 0.044, 'precision_at_100': 0.007, 'precision_at_1000': 0.00102, 'evaluation_time': 12.02}
--DONE--

This is, what I meant @thakur-nandan when I said that I got some small differences between DRES and DRPES. I thought it was because of how trec handled the number of documents we returned per query, but apparently It's just because I was experimenting with high corpus_chunk_size=50000

NouamaneTazi commented 2 years ago

I'll try to debug this weird behaviour and make a more stable DRPES. But for now please refrain from using it for your experiments @Muennighoff. Thank you for raising the issue

Muennighoff commented 2 years ago

Yeah I was leaving corpus_size as the default. The code is here.

Thanks for investigating it!

Muennighoff commented 2 years ago

I'll try to debug this weird behaviour and make a more stable DRPES. But for now please refrain from using it for your experiments @Muennighoff. Thank you for raising the issue

Do you already have an update on this? 😇

jxmorris12 commented 6 months ago

@Muennighoff @NouamaneTazi was this fixed in #107?

Muennighoff commented 5 months ago

I think so; if you use mteb this issue is definitely no longer present as DenseRetrievalParallelExactSearch has been removed in favor of just doing the parallelism on the modeling side e.g. sth like https://github.com/ContextualAI/gritlm/blob/b89fdefa18731f1aa1d6111c3849c1e4c811b9d6/gritlm/gritlm.py#L75