Closed acoulson2000 closed 3 months ago
Ok, this should help: The simple dict example produces it too:
from ranx import Qrels, Run, evaluate
qrels_dict = { "q_1": { "d_12": 5, "d_25": 3 },
"q_2": { "d_11": 6, "d_22": 1 } }
run_dict = { "q_1": { "d_12": 0.9, "d_23": 0.8, "d_25": 0.7,
"d_36": 0.6, "d_32": 0.5, "d_35": 0.4 },
"q_2": { "d_12": 0.9, "d_11": 0.8, "d_25": 0.7,
"d_36": 0.6, "d_22": 0.5, "d_35": 0.4 } }
qrels = Qrels(qrels_dict)
run = Run(run_dict)
print(evaluate(qrels, run, "bpref"))
Hi, this is not a bug. Bpref requires you to also provide known non-relevant docs in Qrels. Otherwise, it cannot be computed. You can find the definition here. To indicate a doc is not relevant, you can add its ID with a relevance score <= 0 to Qrels. Since non-relevant docs are often not available, it is rarely used in practice. I should probably add an error message btw.
I think you've mis-interpretted the description of BRel. The description notes that that it is intended for use when Qrels are incomplete, not when they are complete and have a -1 for non-relevant documents. See https://zipfslaw.org/2016/02/19/trec_eval-calculating-scores-for-evaluation-of-information-retrieval/ which quotes Ellen Voorhees, the TREC project manager: "There are some measures such as bpref that ignore unjudged (i.e., missing in qrels) documents." Also, the trec_eval tool itself produces results using the above data.
Not being critical - I'd like to use your library, as it has some nice features that are not present in, say, pytrec_eval. But we do have queries that match no documents, as well as queries that match irrelevant documents, so I am currently limited to using pytrec_eval or native trec_eval.
From the NIST documentation (https://trec.nist.gov/pubs/trec16/appendices/measures.pdf) : 3.1 bpref The bpref measure is designed for situations where relevance judgments are known to be far from complete. It was introduced in the TREC 2005 terabyte track. bpref computes a preference relation of whether judged relevant documents are retrieved ahead of judged irrelevant documents. Thus, it is based on the relative ranks of judged documents only. The bpref measure is defined as bpref = 1 R r (1− |n ranked higher than r| min(R,N) ) where R is the number of judged relevant documents, N is the number of judged irrelevant documents, r is a relevant retrieved document, and n is a member of the first R irrelevant retrieved documents. Note that this definition of bpref is different from that which is commonly cited, and follows the actual implementation in trec eval version 8.0; see the file bpref bug in the trec eval distribution for details.Bpref can be thought of as the inverse of the fraction of judged irrelevant documents that are retrieved before relevant ones. Bpref and mean average precision are very highly correlated when used with complete judgments. But when judgments are incomplete, rankings of systems by bpref still correlate highly to the original ranking, whereas rankings of systems by MAP do not.
@acoulson2000 sorry for the late reply.
I never said you need complete relevance judgement. From my understanding of the formula, the issue is that, if you have no known irrelevant documents, Bpref is either non computable or always one, which makes its usage quite questionable. As I already suggested you, you should take a look at the formula. I may be wrong, but if you want me to change something you should provide examples and reasoning on what's wrong with current implementation. NaN is raised because with no known irrelevant documents you are dividing by zero here: $\frac{|\text{n ranked higher than r}|}{min(R,N)}$. N is zero. Thus, the min is zero.
My Bpref implementation should be the same as TREC Eval (or at least equal to the TREC Eval version available when I coded it).
Digging into the TREC Eval code isn't something I have time for. I switched to pyTREC (TREC wrapper), which works with my data as well as the sample I posted above.
Not really sure what more information to provide.
I'm new to ranx (and trec_eval, in general).
I'm loading qrels and results trec files into a Qrels and Run, then simply oing: measurements = str(evaluate(qrels, run, ["mrr", "bpref", "precision@6", "precision@10", "precision@25", "ndcg", "ndcg@6", "recall@6", "recall@10", "recall@25", "f1@6"], make_comparable=True))
All other measurements seem to produce valid results: {'mrr': 1.0, 'bpref': nan, 'precision@6': 0.9197177726926011, 'precision@10': 0.9105263157894737, 'precision@25': 0.36421052631578954, 'ndcg': 0.9916407040006012, 'ndcg@6': 0.9795395744834079, 'recall@6': 0.6364284806218444, 'recall@10': 1.0, 'recall@25': 1.0, 'f1@6': 0.7101066299578885}
I'm attaching the files for reference. run-top1000-072324.zip