How can NDCG be calculated for WikiQA?

aneesh-joshi commented 6 years ago

Hi, I'm going through the MatchZoo code for getting ndcg@k values for various models on WikiQA. I can't really understand how wikiqa can be used in NDCG at all!

For example, for a given query, q1, there are several candidate documents d1, d2, d3. the y_true values are [1, 0, 0] The model will provide a y_pred which will be the respective dot products of the query with each of the documents. y_pred = [-0.96715796, 0.9667174, 0.97257483]

From just this data, how is it even possible to infer the ranks that out system has given? if we consider y_pred values as relative ranks, then the ranking is d3, d2, d1

dcg@3 would be rel3 + rel2/log3 + rel1/log4 and Ideal dcg@3 would also be the same, ie, rel3 + rel2/log3 + rel1/log4

and thus ndcg@3 would be (rel3 + rel2/log3 + rel1/log4)/(rel3 + rel2/log3 + rel1/log4) = always 1

Moreover, we never made of the true relevance, ie, y_true This overall doesn't make any sense to me.

I would be really grateful if someone could help me make sense. @faneshion @yangliuy @bwanglzu

aneesh-joshi commented 6 years ago

Here is an actual instance while evaluating wikiqa code

ndcg@3 y_true[pre: suf] is [1 0 0]

y_pred is
[-0.96715796, 0.9667174, 0.97257483]

result: ndcg@3 = 0.5

Here, our model has made an extremely incorrect conclusion that document 1 has a similarity of -0.96 whereas the y_true suggests it is the only correct answer. Yet the ndcg score is as high as 0.5.

aneesh-joshi commented 6 years ago

Another example:

ndcg@5 
y_true[pre:suf] is  [1 0 0 0 0 0]  
y_pred is  [ 0.9705702, -0.9754324, 0.9737232, 0.8897103, 0.9523903, 0.9374584]
ndcg@5 is  0.6309297535714574

bwanglzu commented 6 years ago

@aneesh-joshi Basically, you're evaluating clickthrough data was represented as 1-relevant or 0-not_relevant.

$precision = \frac{RelevantDocs\bigcap RetrievedDocs}{RetrievedDocs}$

based on the assumption of clicked=relevant. So, Precision, Recall, AP, mAP, F1, MRR(Mean reciprocal rank), and X@k is useful for this situation (because in this case, we only know which document is relevant instead of the degree of relevance).

For the situation of Non-binary relevance measurement, we use nDCG to evaluate the degree of relevance or usefulness (say cumulative gain) of each document. So it's not meaningful to use nDCG to evaluate binary labels (because information gain always equals to 1 or 0).

I'd say let's keep the evaluation framework to be unified at the moment since the user of MatchZoo should clearly understand which evaluation metric makes sense for his/her specific situation.

Instead of a random blog post, I would suggest you read evaluation metrics: irbook written by Christopher Manning since it's more formal & easy to understand.

aneesh-joshi commented 6 years ago

For reference of others, @bwanglzu is replying to a comment I later deleted, citing this site:

MAP is a metric for binary feedback only, while NDCG can be used in any case where you can assign relevance score to a recommended item (binary, integer or real).

For clarification, @bwanglzu , are you agreeing that nDCG cannot be used for binary classification like that of WikiQA dataset?

Also, MAP, precision, etc. are better measures for nDCG?

bwanglzu commented 6 years ago

@aneesh-joshi I'd say it can be evaluated with nDCG, but not meaningful or representative.

For binary retrieval problem (say the labels are 0 or 1 indicates relevant or not), we use P, AP, MAP etc.

For non-binary retrieval problem (say the labels are real numbers greater equal than 0 indicates the "relatedness"]), we use nDCG.

In this situation, clearly, mAP@k is a better metric than nDCG@k.

In some cases for non-binary retrieval problem, we use both nDCG & mAP such as microsoft learning to rank challenge link.

aneesh-joshi commented 6 years ago

Thank you very much for that clarification! I've been rattling my brain for days trying to figure out how nDCG is used here.

What about QuoraQP dataset. Since it's either relevant or not, i.e., there are only 2 ranks, would nDCG be applicable there or is MAP, P@k, etc still better?

Also, I don't understand your reasoning for not removing it

since the user of MatchZoo should clearly understand which evaluation metric makes sense for his/her specific situation.

Doesn't it lead to more people who get confused like me?

bwanglzu commented 6 years ago

The previous metrics are all Information Retrieval Metrics. The output should be sorted (for non-binary retrieval situation) list, each value represents how likely this document is related to this query compared to the rest of documents (not independent) Input: Query & List of document to be ranked -> Output: Sorted list of Relevance Scores -> Evaluate: nDCG.

For Quora Question Pair duplicate detection problem, out output is a probability between [0,1] indicates the degree of duplicates, More importantly, the output has no internal relationship (independent) Input: List of question pairs 2 -> Output: list of probability -> Evaluation: Machine Learning Evaluation Metrics such as cross-entropy loss. Also, you can turn QuoraQP into a binary classification problem using a softmax function or logistic function, then use accuracy to evaluate.

To summarize, QuoraQP is, in my view, a completely different task (Machine learning task instead of Information Retrieval task), then use ML evaluation metrics.

I suggest you read Readme.md in MatchZoo, these two tasks are in MatchZoo simply because they have similar input structure, i.e. pair-wise inputs (query-document & question1-question2).

NTMC-Community / MatchZoo

How can NDCG be calculated for WikiQA? #109