Pytrec scores are not consistent in colab

sadakmed commented 3 years ago

the same file and the same code, on my laptop gives different results than on google colab (metrics are P_1, P_3, recip_rank, map and ndcg) I did check the precision (running everything on np.float64), yet not the issue.

any suggestion?

seanmacavaney commented 3 years ago

This is concerning. It'd be helpful to know a bit more about the problem:

What's the magnitude of these differences? (i.e., are the differences on the order of 0.1 or 1e-10)
What platform is your laptop?
When running the files through trec_eval on the particular platform, does it have the same problem? (I believe the CI already makes sure the results are the same as trec_eval, but maybe not in every situation.)
Do you have result and qrel files that you can share which exhibits the problem? Or does it easy enough to reproduce that it happens on most files?
Per-query scores, or the aggregated scores?

seanmacavaney commented 3 years ago

@sadakmed the CI build indicates that:

1) For at least the particular combination of measures/qrels/runs tested, they match expected values at least up to 7 figures on Windows, MacOS, and Linux. 2) The results are consistent with the original trec_eval software at least up to 3 decimal places (the software only reports 4) on Windows, MacOS, and Linux.

So it seems like it may be something about your particular qrels/runs/measures? The additional information requested in the previous comment is essential for getting to to bottom of this.

sadakmed commented 3 years ago

Hi @seanmacavaney the same script evaluate, the same file scores -shared in gist link bellow- on colab it gave:

                                   recip_rank      P_1      P_3      map     ndcg 
system                                                                                
initial_quora-distilbert-multilingual      0.8833   0.8000   0.6333   0.4417   0.6054

on my laptop

                                   recip_rank      P_1      P_3      map     ndcg 
system                                                                                
initial_quora-distilbert-multilingual      0.7533   0.7000   0.5833   0.4693   0.5785

laptop runs on 18.04.01 ubuntu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
Stepping:            9
CPU MHz:             2809.915
CPU max MHz:         3900.0000
CPU min MHz:         400.0000
BogoMIPS:            5799.77
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3

colab also runs ubuntu 18.04.05

in this gist there's the script evaluate, the ground truth gold.tsv and the predictions scores.tsv the command to run

evaluate -g gold.tsv -s scores.tsv

https://gist.github.com/sadakmed/06d631cf6e25754738676d7ba1ea3ae8

seanmacavaney commented 3 years ago

Thanks for the details -- they help a lot.

I wasn't able to reproduce the behavior you mentioned. In particular, the Colab version didn't give the results listed above https://gist.github.com/seanmacavaney/511d16e2f39d212c7bff56b0068b8b72

When using the original trec_eval, I get the same results as you got on your laptop----with the exception of P_3, but it looks like you're using a different formulation of the metric than trec_eval (by requiring at least 3 relevant docs to be counted). When accounting for that, the numbers appear to be correct.

Can you try running the colab gist I sent above to see if you still get different results?

Edit: to be clear, I always got the following results, and they appear to line up with what trec_eval gives:

recip_rank      P_1      P_3      map     ndcg 
0.7533   0.7000   0.5833   0.4693   0.5785

sadakmed commented 3 years ago

after alot of trials, I think the problem -strangely- was from the notebook itself, I did move everything to a new one, and things were good, I did play with other files, and consistently that notebook was giving always different results

seanmacavaney commented 3 years ago

Interesting- thanks for the update.

sadakmed commented 3 years ago

@seanmacavaney I know this is not related to pytrec, However I want to use trec-eval, yet couldn't find any resources on how to structure my ground-truth and scores file to use them with trec-eval.

seanmacavaney commented 3 years ago

It's the TREC qrels format detailed here: https://trec.nist.gov/data/qrels_eng/.

Essentially:

[query_id] 0 [doc_id] [relevance_score]

with spaces or tabs

sadakmed commented 3 years ago

@seanmacavaney this issue consisted randomly -Also I couldn't reproduce it-, in the same notebook that has the problem, trec_eval was giving the expected results, so my suggestion is that the problem is in the interaction between the colab notebook and pytrec_eval,

cvangysel / pytrec_eval

Pytrec scores are not consistent in colab #34