According to the codes, this function evaluate() generates scores for negative items and then rank and get top k items for later evaluation.
However, in map_mrr_ndcg() and precision_recall_ndcg_at_k(), variable hits is calculated by seeing if any single negative item is in test data. If the given negative item set does not contain test item, hits will be 0. This is ridiculous and significantly affects the evaluation results, especially when taking out only 100 negative samples randomly.
The performance of a model highly depends on if it is lucky enough that its negative samples have test set.
According to the codes, this function evaluate() generates scores for negative items and then rank and get top k items for later evaluation.
However, in map_mrr_ndcg() and precision_recall_ndcg_at_k(), variable hits is calculated by seeing if any single negative item is in test data. If the given negative item set does not contain test item, hits will be 0. This is ridiculous and significantly affects the evaluation results, especially when taking out only 100 negative samples randomly.
The performance of a model highly depends on if it is lucky enough that its negative samples have test set.