Explore Relevance Based Performance Benchmarks [LUCENE-8841]

asfimport commented 5 years ago

While discussing improvements in relevance of fuzzy queries with @jimczi, the topic of how to measure impact of changes to relevance of common queries came up. While a non trivial effort, having such a benchmark will allow us to measure the impact of potential changes and also catch regressions well in time.

This Jira tracks ideas and efforts in that direction

Migrated from LUCENE-8841 by Atri Sharma (@atris), updated Jun 08 2019

asfimport commented 5 years ago

Doug Turnbull (@softwaredoug) (migrated from JIRA)

Big +1, though I suspect it would be very hard! This could be an Apache project in and of itself...

One challenge is that the number of use cases Lucene is used is tremendously diverse, from job search, to e-commerce, to legal search, to enterprise search, to news search, to Web search, to everything in between and outside the box. You wouldn't want a situation, for example, where you only have an e-commerce test set, so you end up creating a situation where Enterprise search users are harmed because of decisions made optimizing an e-commerce set.

Another challenge is getting reliable relevance judgments. Teams go deep into developing their methodology for creating a golden set of judgments. This of course can be very domain specific and challenging problem. There's not a one-size-fits-all obvious approach. Some teams use human judges, others crowd source, others very analytics based. Some have access to conversion data, others don't. You have all sorts of biases to contend with in every situation. And the judgments evolve over time. (today's most relevant iPhone isn't the same as 2 years ago). So getting it right takes a lot of energy and time from mature search orgs. So what judgments/data you choose isn't clear if you want to cover a broad range of use cases.

I think the best case is to partner with some organizations that are willing to open up this data alongside their corpus. Where we could validate and feel good about the methodology they use in generating judgments. You'd need to update the relevance judgments and corpus over time. There's of course TREC and other academic datasets, that's one data point. Some folks I know at Wikipedia have talked about this. But you'd want some more commercial datasets (corpus + judgments).

But partnering with orgs would also have limits, as this stuff has very high-value to companies... But perhaps they'd be incentivized to open up their data if Lucene was going to make decisions with it that helped them?!?

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

It used to be an Apache project. :) (Lucene subproject actually) https://lucene.apache.org/openrelevance/

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

Could we consider resurrecting it then? – Regards,

Atri l'apprenant

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Yes that would be possible.

apache / lucene

Explore Relevance Based Performance Benchmarks [LUCENE-8841] #9885