Closed sadakmed closed 3 years ago
This is concerning. It'd be helpful to know a bit more about the problem:
trec_eval
on the particular platform, does it have the same problem? (I believe the CI already makes sure the results are the same as trec_eval
, but maybe not in every situation.)@sadakmed the CI build indicates that:
1) For at least the particular combination of measures/qrels/runs tested, they match expected values at least up to 7 figures on Windows, MacOS, and Linux.
2) The results are consistent with the original trec_eval
software at least up to 3 decimal places (the software only reports 4) on Windows, MacOS, and Linux.
So it seems like it may be something about your particular qrels/runs/measures? The additional information requested in the previous comment is essential for getting to to bottom of this.
Hi @seanmacavaney the same script evaluate, the same file scores -shared in gist link bellow- on colab it gave:
recip_rank P_1 P_3 map ndcg
system
initial_quora-distilbert-multilingual 0.8833 0.8000 0.6333 0.4417 0.6054
on my laptop
recip_rank P_1 P_3 map ndcg
system
initial_quora-distilbert-multilingual 0.7533 0.7000 0.5833 0.4693 0.5785
laptop runs on 18.04.01 ubuntu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
Stepping: 9
CPU MHz: 2809.915
CPU max MHz: 3900.0000
CPU min MHz: 400.0000
BogoMIPS: 5799.77
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
colab also runs ubuntu 18.04.05
in this gist there's the script evaluate
, the ground truth gold.tsv
and the predictions scores.tsv
the command to run
evaluate -g gold.tsv -s scores.tsv
https://gist.github.com/sadakmed/06d631cf6e25754738676d7ba1ea3ae8
Thanks for the details -- they help a lot.
I wasn't able to reproduce the behavior you mentioned. In particular, the Colab version didn't give the results listed above https://gist.github.com/seanmacavaney/511d16e2f39d212c7bff56b0068b8b72
When using the original trec_eval
, I get the same results as you got on your laptop----with the exception of P_3
, but it looks like you're using a different formulation of the metric than trec_eval
(by requiring at least 3 relevant docs to be counted). When accounting for that, the numbers appear to be correct.
Can you try running the colab gist I sent above to see if you still get different results?
Edit: to be clear, I always got the following results, and they appear to line up with what trec_eval
gives:
recip_rank P_1 P_3 map ndcg
0.7533 0.7000 0.5833 0.4693 0.5785
after alot of trials, I think the problem -strangely- was from the notebook itself, I did move everything to a new one, and things were good, I did play with other files, and consistently that notebook was giving always different results
Interesting- thanks for the update.
@seanmacavaney I know this is not related to pytrec, However I want to use trec-eval, yet couldn't find any resources on how to structure my ground-truth and scores file to use them with trec-eval.
It's the TREC qrels format detailed here: https://trec.nist.gov/data/qrels_eng/.
Essentially:
[query_id] 0 [doc_id] [relevance_score]
with spaces or tabs
@seanmacavaney this issue consisted randomly -Also I couldn't reproduce it-, in the same notebook that has the problem, trec_eval was giving the expected results, so my suggestion is that the problem is in the interaction between the colab notebook and pytrec_eval,
the same file and the same code, on my laptop gives different results than on google colab (metrics are P_1, P_3, recip_rank, map and ndcg) I did check the precision (running everything on np.float64), yet not the issue.
any suggestion?