Open MartinLichtblau opened 5 years ago
Thanks for the feedback and interest in this work. I am familiar with both of the papers you cited.
For WebTrack, we measure using ERR@20 and nDCG@20. For Robust04, we use nDCG@20 and P@20. Although commonly used for evaluation, MAP makes some unrealistic assumptions about user behavior [1], so we decided to focus on these measures which we feel adequately describe the performance of our approach (especially given the limited space in a short paper). That being said, others have asked to see MAP results [2], so we are considering including MAP results in an extended version.
Let me know if you have any other questions!
[1] Norbert Fuhr. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. ACM SIGIR Forum 2017. [2] https://twitter.com/craig_macdonald/status/1118169091955621888
I can't find anything about the mean average precison of your new system (CEDR). Am I missing something or did you really not measured it? Since it's the most common evaluation metric in IR I wonder why you didn't even mention it in the paper.
Furthermore these resources could be relevant to you: