cvangysel / pytrec_eval

pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.
http://ilps.science.uva.nl/
MIT License
282 stars 32 forks source link

Pytrec scores are not consistent in colab #34

Closed sadakmed closed 3 years ago

sadakmed commented 3 years ago

the same file and the same code, on my laptop gives different results than on google colab (metrics are P_1, P_3, recip_rank, map and ndcg) I did check the precision (running everything on np.float64), yet not the issue.

any suggestion?

seanmacavaney commented 3 years ago

This is concerning. It'd be helpful to know a bit more about the problem:

seanmacavaney commented 3 years ago

@sadakmed the CI build indicates that:

1) For at least the particular combination of measures/qrels/runs tested, they match expected values at least up to 7 figures on Windows, MacOS, and Linux. 2) The results are consistent with the original trec_eval software at least up to 3 decimal places (the software only reports 4) on Windows, MacOS, and Linux.

So it seems like it may be something about your particular qrels/runs/measures? The additional information requested in the previous comment is essential for getting to to bottom of this.

sadakmed commented 3 years ago

Hi @seanmacavaney the same script evaluate, the same file scores -shared in gist link bellow- on colab it gave:

                                   recip_rank      P_1      P_3      map     ndcg 
system                                                                                
initial_quora-distilbert-multilingual      0.8833   0.8000   0.6333   0.4417   0.6054 

on my laptop

                                   recip_rank      P_1      P_3      map     ndcg 
system                                                                                
initial_quora-distilbert-multilingual      0.7533   0.7000   0.5833   0.4693   0.5785 

laptop runs on 18.04.01 ubuntu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
Stepping:            9
CPU MHz:             2809.915
CPU max MHz:         3900.0000
CPU min MHz:         400.0000
BogoMIPS:            5799.77
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3

colab also runs ubuntu 18.04.05

in this gist there's the script evaluate, the ground truth gold.tsv and the predictions scores.tsv the command to run

evaluate -g gold.tsv -s scores.tsv

https://gist.github.com/sadakmed/06d631cf6e25754738676d7ba1ea3ae8

seanmacavaney commented 3 years ago

Thanks for the details -- they help a lot.

I wasn't able to reproduce the behavior you mentioned. In particular, the Colab version didn't give the results listed above https://gist.github.com/seanmacavaney/511d16e2f39d212c7bff56b0068b8b72

When using the original trec_eval, I get the same results as you got on your laptop----with the exception of P_3, but it looks like you're using a different formulation of the metric than trec_eval (by requiring at least 3 relevant docs to be counted). When accounting for that, the numbers appear to be correct.

Can you try running the colab gist I sent above to see if you still get different results?

Edit: to be clear, I always got the following results, and they appear to line up with what trec_eval gives:

recip_rank      P_1      P_3      map     ndcg 
0.7533   0.7000   0.5833   0.4693   0.5785 
sadakmed commented 3 years ago

after alot of trials, I think the problem -strangely- was from the notebook itself, I did move everything to a new one, and things were good, I did play with other files, and consistently that notebook was giving always different results

seanmacavaney commented 3 years ago

Interesting- thanks for the update.

sadakmed commented 3 years ago

@seanmacavaney I know this is not related to pytrec, However I want to use trec-eval, yet couldn't find any resources on how to structure my ground-truth and scores file to use them with trec-eval.

seanmacavaney commented 3 years ago

It's the TREC qrels format detailed here: https://trec.nist.gov/data/qrels_eng/.

Essentially:

[query_id] 0 [doc_id] [relevance_score]

with spaces or tabs

sadakmed commented 3 years ago

@seanmacavaney this issue consisted randomly -Also I couldn't reproduce it-, in the same notebook that has the problem, trec_eval was giving the expected results, so my suggestion is that the problem is in the interaction between the colab notebook and pytrec_eval,