elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.61k stars 24.63k forks source link

Ranking Evaluation API: Allow multiple metrics per request #51680

Closed joshdevins closed 2 months ago

joshdevins commented 4 years ago

The ranking evaluation API currently only supports calculating a single metric per request. We would typically optimize a single metric, so this interface makes sense, however we often want to also show or plot a series of other metrics in order to better understand relevance. Instead of having to rerun the same queries with the same ratings, it would be nice to be able to specify one or more metrics to be calculated.

For example:

{
  "requests": [ ],
  "metrics": [
    {
      "precision": {
        "k": 20,
        "relevant_rating_threshold": 2,
        "ignore_unlabeled": true
      }
    },
    {
      "mean_reciprocal_rank": {
        "k": 20,
        "relevant_rating_threshold": 2
      }
    }
  ]
}
elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Ranking)

sbourke commented 4 years ago

Hi -

I’m taking a look at this. I’d appreciate some feedback on what way to expose this change. Adding support for multiple metrics changes the response / request specs.

Should I ensure that the current request / response specs remain the same but also make it feasible for arrays of metrics to be passed and returned. Or would it be ok to change the spec and thus cause a breaking change ?

Looking at the underlying implementation for the rank-eval module my preference is to cause the breaking change as EvaluationMetric is passed around in many places and supporting single values and arrays for metrics doesn’t feel consistent to me.

However, as that changes the API I’m worried if that is a faux-pas in the elastic code base

joshdevins commented 4 years ago

@sbourke My guess is that we would have to make a breaking change but I will ask @cbuescher to comment authoritatively.

cbuescher commented 4 years ago

Or would it be ok to change the spec and thus cause a breaking change

Changing something on the Java layer isn't the biggest concern here. The real issue will be if we can support this on the REST request and response side in a backward compatible way. The issue mentioned already replacing the single metric object with an array in the request, but on the response side we would also have to return multiple quality_level scores, both on the top level and in the query details. Also the metric_details section would need to become an array. We would need to think about if we can do this in a bwc way, which would be prefferred over breaking the API. Making the quality_level outputs in the response arrays would also raise the question how to map input metrics to the output. Should they be keyed or e.g. appear in the same order as in the request. I think we should get a better understanding around the goals of having multiple metrics and the tradeoffs involved here.

joshdevins commented 4 years ago

The real issue will be if we can support this on the REST request and response side in a backward compatible way.

Making the quality_level outputs in the response arrays would also raise the question how to map input metrics to the output. Should they be keyed or e.g. appear in the same order as in the request.

I think we can go about this by just using keyed fields instead of arrays for both the input and output. We might deprecate/rename the output metric_score to metrics when we have multiple requested metrics. Otherwise I think we can do this in a backwards compatible way.

E.g. Given two metrics, precision@k and recall@k, we might have the following request/response.

Request:

{
    "requests": [
    {
        "id": "JFK query",
        "request": { "query": { "match_all": {}}},
        "ratings": []
    }],
    "metric": {
      "precision": {
        "k" : 20,
        "relevant_rating_threshold": 1,
        "ignore_unlabeled": false
      },
      "recall": {
        "k" : 20,
        "relevant_rating_threshold": 1,
        "ignore_unlabeled": false
      }
   }
}

Response:

{
    "rank_eval": {
        "metrics": {
            "precision": 0.6,
            "recall": 0.75
        }, 
        "details": {
            "my_query_id1": { 
                "metrics": {
                    "precision": 0.6,
                    "recall": 0.75
                }, 
                "unrated_docs": [...],
                "hits": [...],
                "metric_details": { 
                    "precision" : {
                        "relevant_docs_retrieved": 6,
                        "docs_retrieved": 10
                    },
                    "recall" : {
                        "relevant_docs_retrieved": 6,
                        "relevant_docs": 8
                    }
                }
            },
            "my_query_id2" : { [...] }
        },
        "failures": { [...] }
    }
}

I think we should get a better understanding around the goals of having multiple metrics and the tradeoffs involved here.

I agree and I'm happy to discuss here or in a meta issue or otherwise. The purpose is simply to get a more complete picture of the quality of queries. A single metric is often used as a metric to optimize, but generally you want to have some understanding of tradeoffs. So in the precision/recall example, it's often useful to draw a precision/recall curve to make decisions about how to tune based on the business objectives. This might not make sense for all metrics, but I'd argue that you want to see multiple metrics together (e.g. nDCG, MAP, and recall@k) in order to gain a fuller picture of search relevance. As with other statistics (e.g. mean), you can't tell a story with just a single metric.

... better understanding around the goals ...

This does raise another concern (that we can move to another GH issue if you'd like) about the design of the API as a single call as opposed to stages. Typically you would avoid this type of problem as described in this issue by persisting the output of the search results and performing the metrics after the fact, on the search results output. This allows you to change your metric parameters (e.g. k in your metrics, so long as k is less than or equal to a limit set in your search DSL) or even switch/add metrics without having to re-execute all the queries (which can be costly/time consuming). Staging the evaluation in this way would alleviate also the problem of modifying, adding or removing relevance judgements and having to rerun all the queries. The only time you should really have to rerun the queries are if (a) the corpus or index changes (mappings, etc.), or (b) the queries change (DSL, query strings, etc.). Happy to discuss this further in another GH issue if we want.

sbourke commented 4 years ago

Parroting @joshdevins 's views on why you might like to have multiple metrics.

Allowing for the module (RankEval) to stage the result of evaluation queries would be one way to avoid changing the current request / response specs, and would allow more freedom for an end user.

Did you kick off another thread on what the API would look like?

joshdevins commented 4 years ago

@sbourke I think we need to discuss a bit what our next steps are for the rank eval API. This is something that's been under discussion for a while so let's not create a new ticket to discuss major changes. We can continue to discuss adding multiple metrics though — that should be in-scope.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search (Team:Search)

javanna commented 2 months ago

This has been open for quite a while, and hasn't had a lot of interest. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.