Better ways to combine relevance signals

jpountz commented 7 years ago

Say you want to combine full-text scores with geo proximity. We offer function_score as a way to do that, but this is a very manual process, and it's hard to make sure those signals are combined in a way that actually makes sense. On the other hand, research has been conducted on that topic so we should look at how we could integrate better ways to combine relevance signals. I'm mentioning geo since it seems to be the more researched area but ideally we'd address that issue for time-based relevance and other common use-cases.

JnBrymn-EB commented 7 years ago

Is there any documentation about the research? For instance, given the score for some bool query A and the score for some other bool query B, I would very much like to combine these scores in arbitrary ways. Adding, multiplying, some higher function, etc. Will that be possible.

I'm interested in helping with this, per my previous work with https://github.com/elastic/elasticsearch/pull/19710, I agree that it's time to step back and rethink the API here.

jpountz commented 7 years ago

Admittedly I know very little about the topic but Google scholar seems to find resources when searching for geo bm25 or geo ranking. @s1monw do you have pointers by any chance? I remember you mentioning this in a discussion a couple weeks ago.

erebus1 commented 7 years ago

The possibility of combing function scores in arbitrary way, at least add and multiply, seems to be very useful. Looking forward for implementation.

JnBrymn-EB commented 7 years ago

Here's a great example that I will need soon. We have a machine learning model that looks at items that have been purchased by users and applies topic modeling to so that both our user and our items are associated with a handful of topics (cardinality ~=1000). At index time we add the topics associated with the events to a "topics" field. At query time we retrieve the topics for a user and incorporate the topics into a query.

Ideally, the topics would act as a multiplicative boost upon a more traditional search over content filtered by department. However the only way to incorporate the topics at this time is in a should clause which creates a score that is additively combined with the content query. This might work, but I would really prefer having multiplicative boost in my search palate as well.

jillesvangurp commented 7 years ago

I'm currently working on a project related to metallurgical data with nested properties of different types with double value ranges at different temperatures. For example we care about things like yield strength, density, magnetic properties, etc. measured at different temperatures. Our search queries query and filter on these and other properties and are ranking on a given base specification of property ranges at particular temperature ranges.

This is a quite interesting ranking problem and I've had some success applying function scores but I'm still struggling with getting more control over how things rank and figuring out how to best use function scores and how to combine them. What I'm faced with as a uses is things that have either no, very large, or unexpected effects (e.g. a default score of 1 because a value is missing outranks a document with a nearly perfect matching value that is scored lower). The only usable workaround I've found is to use a script score that returns 0. This in turns rules out using multiply. because the 0 would wipe anything else out. Summing things up or averaging sounds nice but it assumes things are loosely comparable. So normalization would be typically required and custom logic for combining the scores of multiple sub queries (including function scores) would definitely be something I'd want.

In short, I'd like to do things like normalize values so they have predictable ranges and are guaranteed to be not 0, apply boosts so I can fine tune which function scores matter more, combine things such that documents with some missing values are ranked lower but don't disappear entirely, take the best score for each multivalued property, etc.

rpedela commented 6 years ago

I think this paper from 2012 largely answers your question @jpountz. To be clear, I am not advocating implementing this specific paper. Rather I think it does a good job giving an overview and literature review of normalized linear combination for relevance ranking. Here are a few relevant quotes.

To produce such a multi-criteria ranking, we first need to score documents in each criterion and then combine these scores to generate a final ranking score for each document. Linear combination of different aspect scores has so far been the dominant approach due to its simplicity and effectiveness [1,2].

Linear combination model requires that the scores to be combined are “comparable” to each other, an assumption that generally does not hold due to the different ways of scoring each relevance criterion.

In previous studies, different methods have been proposed to normalize scores prior to combination [6–11] with the aim of making them more comparable.

Default Ranking Function

I think the default ranking function for function_score should be a normalized linear combination because it is simple, effective, and intuitive.

sum_of_weights = 1.0

default_weight = (sum_of_weights - sum_of_user_weights) / (num_queries + num_funcs)

0.0 <= (norm_query_score * query_weight) + (norm_func_score * func_weight) + ... <= 1.0

The beauty is that setting weights becomes intuitive. A query weight of 0.8 means the query score accounts for 80% of the final score. If function_score had the ability to combine multiple queries, it would also make setting query boosts intuitive. Setting good weights may still be difficult which is where the LTR plugin helps, but at least it is intuitive. Interestingly, the winners of the Yahoo! Learning to Rank Challenge used a linear combination of 12 ranking models so this could help LTR users as well.

Normalization

In function_score, random_score and the decay functions are already normalized to [0,1], and it is up to the user to normalize script_score. What is left is query and field_value_factor. While the paper discusses several normalization approaches, I think dividing by the max score and max value for query and field_value_factor respectively is the most straightforward. Maybe other normalization approaches could be implemented later.

Aren't normalized scores a bad idea?

This Lucene wiki page explains why normalized scores are a bad idea. If the goal is use the normalized score as a measure of absolute relevance, then yes I agree it is a bad idea. Ranking metrics such as ERR, MAP, etc should be used for that purpose. However in the context of a linear combination of textual relevance, recency, popularity, etc, it is necessary. The different relevance signals must be normalized in order to be comparable and combined.

Implementation

Maybe some defaults need to be changed or some options added, but function_score already has most of the implementation. Or as suggested in #27588, replace function_score with a set of discrete queries one of them being a normalized linear combination query. Regardless, the big feature that is missing is normalizing a score.

Admittedly, I know very little about ES internals. However I believe normalization could fit into the current map/reduce query execution structure. On each shard, the query is executed normally except the score is normalized using the shard's max score before being combined with the other relevance signals to give the top N per shard. In the reduce phase, the query is re-normalized using the max score across all shards and then re-combined with the other relevance signals. I believe that would work mathematically and it should only add a slight overhead compared to the existing workaround detailed below.

Workaround

There is a workaround, but it is not performant. In order to normalize the query score, you need to get the max query score first which means executing the same query twice.

// first query
{
  "match": {
    "my_field": "with some fancy terms"
  }
}

// second query
{
  "function_score": {
    "boost_mode": "replace",
    "score_mode": "sum",
    "query": {
      "match": {
        "my_field": "with some fancy terms"
      }
    },
    "functions": [
      // normalized query score
      {
        "script_score": {
          "script": {
            "params": {
              "max_score": 34.14
            },
            "source": "_score / params.max_score"
          }
        },
        "weight": 0.8
      },
      // normalized log field_value_factor
      {
        "script_score": {
          "script": {
            "params": {
              "max_votes": 4135
            },
            "source": "Math.log(doc['votes'].value) / Math.log(params.max_votes)"
          }
        },
        "weight": 0.2
      }
    ]
  }
}

rpedela commented 6 years ago

One more thing about query boosts and normalization.

The bool query is, in fact, a linear combination of comparable values: BM25 scores. I think this is why it is so effective. Personally, I find it more effective that dis_max even though that query certainly has its uses. However query boosts are not intuitive.

Intuitively, I would expect a boost of 2 to mean that the boosted query is 2x more important than the other queries. The problem is that the bool query will return scores in the 10s, 100s, < 1.0, etc depending on the query and data. Therefore the meaning of boost = 2 changes between queries. Normalization can help here too.

If the scores were all normalized between 0 and 1 and the default boost = 1.0, then query boosts become intuitive. A boost of 2 actually means that the query is 2x more important. Normalization between 0 and 1 would make boosting bool queries intuitive.

seekingpeace commented 6 years ago

@rpedela While working around for a solution of getting weighted average of precomputed scores in filtered documents, the results were highly denormalized and I thought of the same Workaround as you discussed. Still I'm just wondering that the sorting based on scores would be done after any aggregation. More specifically can I use a computed aggregation max ( max field value in filtered documents ) in a same query by help of scripting somehow?

rpedela commented 6 years ago

@seekingpeace Are you asking how to combine an aggregation with the query score, or how to achieve a similar workaround for an aggregation? If the latter, I think a similar workaround for aggregations could be achieved with the bucket script aggregation, but I haven't tried it myself.

seekingpeace commented 6 years ago

@rpedela

How to combine an aggregation with the query score

Yes, is it possible somehow In a single query.

rpedela commented 6 years ago

@seekingpeace As far as I know, you can't combine an aggregation and function_score directly. But you could perform the aggregation in the first query and then input the result as a parameter to the one of your functions in function_score of the second query. Here is the above example with a max aggregation of votes and a filter.

// first query
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "my_field": "with some fancy terms"
        }
      },
      "filter": {
        "term": { "food": "bananas" }
      }
    }
  },
  "aggs": {
    "max_votes": {
      "max": { "field": "votes" }
    }
  }
}

// second query
{
  "function_score": {
    "boost_mode": "replace",
    "score_mode": "sum",
    "query": {
      "bool": {
        "must": {
          "match": {
            "my_field": "with some fancy terms"
          }
        },
        "filter": {
          "term": { "food": "bananas" }
        }
      }
    },
    "functions": [
      // normalized query score
      {
        "script_score": {
          "script": {
            "params": {
              "max_score": 34.14
            },
            "source": "_score / params.max_score"
          }
        },
        "weight": 0.8
      },
      // normalized log field_value_factor
      {
        "script_score": {
          "script": {
            "params": {
              "max_votes": AGGREGATED_MAX_VOTES
            },
            "source": "Math.log(doc['votes'].value) / Math.log(params.max_votes)"
          }
        },
        "weight": 0.2
      }
    ]
  }
}

talevy commented 6 years ago

Pinging @elastic/es-search-aggs

jpountz commented 6 years ago

Thanks for contributing to the discussion @rpedela !

Linear combination of different aspect scores has so far been the dominant approach due to its simplicity and effectiveness

I agree with that part. It turns out that Lucene is also getting improvements for computing the topk-k hits when the score is a linear combination (block-max WAND). This is interesting because it means that more subtle scoring that takes several features into account could also run on the whole collection and not only as a rescorer.

This Lucene wiki page explains why normalized scores are a bad idea. If the goal is use the normalized score as a measure of absolute relevance, then yes I agree it is a bad idea. Ranking metrics such as ERR, MAP, etc should be used for that purpose. However in the context of a linear combination of textual relevance, recency, popularity, etc, it is necessary. The different relevance signals must be normalized in order to be comparable and combined.

It is not totally obvious to me that scores need to be normalized in order for the combination of static scoring factors with bm25 scores to be meaningful. For instance say you search for restaurants and compute final_score = bm25_score + alpha * log(1 + 1 / geodistance) (random example). Not normalizing bm25_score means that the bm25 score will have a greater weight relatively to the geo distance when searching for rare terms, ie. when the information need is specific (eg. senegalese restaurant), and that the geo distance will have a greater relative weight when searching for common terms (eg. pizza). I'm not saying it is the right thing to do, but it's not absurd either. I read papers that went both ways.

There is also a practical issue with computing the maximum score. It would be perfectly fine in a rescoring context, but if you wanted to do it on the entire document collection you would have to make approximations or run the query twice? Thinking out loud, this might mean that we should specialize how to combine relevance signals depending on whether you are rescoring or not.

mayya-sharipova commented 6 years ago

@jpountz Very information comments and example about normalization. I agree with you, if we find a good way to combine relevance signals (in your example, alpha parameter and log function), that normalization is not necessary. The challenge is to find these parameters. I encountered research papers, where sophisticated ML algorithms and genetic programming have been used to find optimized parameters and functions. So, an argument for normalization, can be that normalization allows us to skip this expensive step.

@rpedela About arguments against normalization. Normalized results may not follow the exact patterns as their row results. And also, as @jpountz has already pointed out normalization can be expensive thing to do on the ES side. Normalization algorithms require not only to know maximum score, but also minimum score. And for the whole document collection this could be an expensive operation.

I agree with @jpountz, it is worth exploring if we can do normalization only for the rescoring part.

JnBrymn-EB commented 6 years ago

Great discussion!

I am aware that that most of the IR community uses linear combination of signals, but I've always been wary of the implications. Let's go back to a slightly modified version of @jpountz's example:

final_score = bm25_score + alpha * (1 + exp(- beta * geodistance))

Quoting @pountz:

Not normalizing bm25_score means that the bm25 score will have a greater weight relatively to the geo distance when searching for rare terms, ie. when the information need is specific (eg. senegalese restaurant), and that the geo distance will have a greater relative weight when searching for common terms (eg. pizza)

That sounds reasonable, but consider this corner case that, for eventbrite.com, isn't a corner case at all: query=beer. This is a common term in our index because lots of events mention "beer". But if a user makes this query they want an event about beer (festivals, classes, etc.). Since "beer" is a common term, the final_score above will be dominated by the geodistance and rather than returning the beer festival 20 miles away, the results will be dominated by business luncheons, speed dating events, and music shows that offer "beer" but really aren't about beer at all.

The way that I've gotten around this is a bit non-standard, but it makes theoretical sense and it seems to work. I multiply the terms together:

final_score = bm25_score * (1 + alpha * exp(- beta * geodistance))

In this case, I think of the bm25_score as a measure of "how good this matches the user's needs content-wise" and the multiplied term as a boost on top of that. The multiplier's default value is 100% (e.g. bm25_score*1.0) and the closer the event is, then the bigger the bonus. So if alpha = 0.10 then the final score for a super close event is 110% of the base bm25_score.

I'm not entirely comfortable with this. Linear combination of signals is certainly easier in Elasticsearch (this is why I worked with Brita a year ago to extend Elasticsearch's capabilities). So I'm interested in learning others' opinions.

What do you guy think about my approach (feel free to poke holes). And can you refer me to papers that I should read regarding this topic?

mayya-sharipova commented 6 years ago

@JnBrymn-EB Thanks for outlining your search problem, it is very interesting indeed. Your scoring formula makes sense to me. And I think this was the intuition behind the original development of ES Function Score Query. Hence, we have boost_mode to boost score with some other signals, and the default operation for boost_mode is multiplication. But I would also add, it seems that your formula is more based on heuristics, and may not be the best for all use cases.

The papers on this topic that we encounter that argue for linear combination of score and other signals are:

Robertson, Stephen, and Hugo Zaragoza. “The probabilistic relevance framework: BM25 and beyond.” Foundations and Trends® in Information Retrieval 3.4 (2009): 333-389.
N. Craswell, S. E. Robertson, H. Zaragoza, and M. Taylor, “Relevance weighting for query independent evidence,” in Proceedings of the 28th Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479, ACM, 2005.

JnBrymn-EB commented 6 years ago

Thanks @mayya-sharipova

jpountz commented 6 years ago

@JnBrymn-EB Your example makes sense to me. I also like it because it highlights the benefit of taking static scoring signals into account only in the rescoring phase: since you are only considering documents that have a good base bm25 score, there are fewer opportunities for making mistakes. I'm not saying nobody should score based on features on the entire collection, but if you are going that road, it's definitely important to be careful about the implications.

rpedela commented 6 years ago

Definition of "better"

I think it is useful to define what "better" means. To me, it is making the manual combination of several, distinct relevance signals intuitive. Intuitive is the key word because practically it means I can spend less time on this task and move on.

Implementation

I realize there are challenges to normalization, but I doubt they are insurmountable. However, let's table that discussion for the moment. First, I think we should figure out if it is even worth it. Will it help? Obviously I think so, but nonetheless I think implementation is a separate discussion.

Unnormalized Addition

It is not totally obvious to me that scores need to be normalized in order for the combination of static scoring factors with bm25 scores to be meaningful. For instance say you search for restaurants and compute final_score = bm25_score + alpha * log(1 + 1 / geodistance) (random example). Not normalizing bm25_score means that the bm25 score will have a greater weight relatively to the geo distance when searching for rare terms, ie. when the information need is specific (eg. senegalese restaurant), and that the geo distance will have a greater relative weight when searching for common terms (eg. pizza). I'm not saying it is the right thing to do, but it's not absurd either. I read papers that went both ways.

Yes that example is reasonable and certainly not absurd, but it is also simple. As complexity increases, it gets harder and harder to reason about how the different relevance signals combine to compute the final score. Even your example can be difficult to get right if the BM25 query is sufficiently complex. For a real-world example, I am working on SEC filings search and the BM25 query is essentially this:

{
  "multi_match": {
    "fields": [
      "company_name",
      "company_name.shingles",
      "company_name.synonyms",
      "title",
      "title.stemmed",
      "title.shingles",
      "body",
      "body.stemmed"
    ],
    "type": "most_fields"
  }
}

It is a simplified version, but nonetheless I think it shows the complexity of combining BM25 with geo distance or any other static signal. There are thousands of unique queries and depending on the keywords, a query's max BM25 score can range from 1 to 1,000. A common term that matches every field will tend to have a higher score than a rare term that only matches 1 or 2 fields. Using best_fields instead would dampen the scoring effect of matching every field, but it turns out that most_fields performs better on average for my use case. The BM25 score is then combined with popularity, recency, page rank, and a few other static signals. Let's assume a similar BM25 query with the same max score range is used in your restaurant example and let's do some math.

Top 3 documents for "senegalese restaurant" for BM25 only
1. Awesome Senegalese, BM25 1000, geo_distance 30
2. Bob's Senegalese, BM25 990, geo_distance 10
3. Senegalese Express, BM25 980, geo_distance 1

I would argue that all 3 restaurants are highly relevant and we want to order them by geo_distance. What happens when we apply your equation assuming alpha is 1.0?

Top 3 documents for "senegalese restaurant" for BM25 and geo_distance
1. Awesome Senegalese: 1000 + 1 * log(1 + 1 / 30) = 1000.014
2. Bob's Senegalese: 990 + 1 * log(1 + 1 / 10) = 990.041
3. Senegalese Express: 980 + 1 * log(1 + 1 / 1) = 980.301

Well that didn't work. What happens if we normalize the BM25 score?

Top 3 documents for "senegalese restaurant" for BM25 normalized and geo_distance
1. Senegalese Express: (980 / 1000) + 1 * log(1 + 1 / 1) = 1.281
2. Bob's Senegalese: (990 / 1000) + 1 * log(1 + 1 / 10) = 1.031
3. Awesome Senegalese: (1000 / 1000) + 1 * log(1 + 1 / 30) = 1.014

That did work, but why? Without normalization, BM25 overwhelms geo_distance. By normalizing the BM25 scores, they have the same scale as geo_distance and thus are comparable. But couldn't we have just set alpha appropriately instead?

Top 3 documents for "senegalese restaurant" for BM25, geo_distance, alpha = 1000
1. Senegalese Express: 980 + 1000 * log(1 + 1 / 1) = 1281.030
2. Bob's Senegalese: 990 + 1000 * log(1 + 1 / 10) = 1031.393
3. Awesome Senegalese: 1000 + 1000 * log(1 + 1 / 30) = 1014.240

The answer is yes. But can we use the same alpha for when the max BM25 score changes significantly? Let's make max_bm25_score = 10.

Top 3 documents for "senegalese restaurant" for BM25 max_score = 10, geo_distance, alpha = 1000
1. Senegalese Express: 9.8 + 1000 * log(1 + 1 / 1) = 310.830
2. Pizza Restaurant: 1.0 + 1000 * log(1 + 1 / 1) = 302.029
3. Bob's Senegalese: 9.9 + 1000 * log(1 + 1 / 10) = 51.230
4. Awesome Senegalese: 10 + 1000 * log(1 + 1 / 30) = 24.240

That worked except we now have a pizza place with rank 2 because it is so close geographically even though bm25 = 1.0. Now geo_distance overwhelms BM25. What if we just normalize rather than setting alpha?

Top 3 documents for "senegalese restaurant" for BM25 normalized max_score = 10, geo_distance, alpha = 1
1. Senegalese Express: (9.8 / 10) + 1 * log(1 + 1 / 1) = 1.281
2. Bob's Senegalese: (9.9 / 10) + 1 * log(1 + 1 / 10) = 1.031
3. Awesome Senegalese: (10 / 10) + 1 * log(1 + 1 / 30) = 1.014
...
N. Pizza Restaurant: (1.0 / 10) + 1 * log(1 + 1 / 1) = 0.401

Again that works. Sure, we could have set alpha again and get the desired result. But you don't have to do that with normalization. Practically speaking, no one is going to manually analyze 1000s of unique queries to find the best weights for each query. They could use ML, but generating a good training dataset isn't always practical either. Normalization allows a user to manually set good weights once regardless of the max BM25 score range.

Unnormalized Multiplication

@JnBrymn-EB uses multiplication instead of addition. I believe multiplication is recommended by the Solr community for combining different relevance signals. He also uses exp but the primary difference is multiplication. Let's try the restaurant example with multiplication instead.

Top 3 documents for "senegalese restaurant" for BM25, geo_distance, alpha = 1
1. Senegalese Express: 980 * 1 * log(1 + 1 / 1) = 295.009
2. Bob's Senegalese: 990 * 1 * log(1 + 1 / 10) = 27.748
3. Awesome Senegalese: 1000 * 1 * log(1 + 1 / 30) = 14.240

Yes, that works! And it didn't require normalization. Why? In this case, geo_distance is the weight for the BM25 score and scales it accordingly. Okay great, but will this continue to work when we add additional static signals? What happens when we want to improve relevance by adding avg_review_score on a 5 point scale?

avg_review_score
Senegalese Express: 1
Bob's Senegalese: 5
Awesome Senegalese: 3

Top 3 documents for "senegalese restaurant" for BM25, geo_distance, alpha = 1, avg_review_score
1. Senegalese Express: 980 * 1 * log(1 + 1 / 1) * 1 = 295.009
2. Bob's Senegalese: 990 * 1 * log(1 + 1 / 10) * 5 = 204.894
3. Awesome Senegalese: 1000 * 1 * log(1 + 1 / 30) * 3 = 42.721

Hmm that doesn't seem to work. I am not sure where Awesome Senegalese should rank in this scenario, but Bob's Senegalese should definitely be ranked higher than Senegalese Express. No one wants a restaurant with an average rating of 1 as the top result. What if we apply normalization to both the BM25 score and review score, but still use multiplication?

Top 3 documents for "senegalese restaurant" for BM25, geo_distance, alpha = 1, avg_review_score
1. Senegalese Express: (980/1000) * 1 * log(1 + 1 / 1) * (1/5) = 0.059
2. Bob's Senegalese: (990/1000) * 1 * log(1 + 1 / 10) * (5/5) = 0.041
3. Awesome Senegalese: (1000/1000) * 1 * log(1 + 1 / 30) * (3/5) = 0.009

Still didn't work. What if we use addition and normalization?

Top 3 documents for "senegalese restaurant" for BM25, geo_distance, alpha = 1, avg_review_score
1. Bob's Senegalese: (990/1000) + 1 * log(1 + 1 / 10) + (5/5) = 2.031
2. Awesome Senegalese: (1000/1000) + 1 * log(1 + 1 / 30) + (3/5) = 1.614
3. Senegalese Express: (980/1000) + 1 * log(1 + 1 / 1) + (1/5) = 1.481

Yes, that looks a lot better. We could of course set the weights without normalization to get a similar result, but you will have to set the weights differently as the BM25 score's scale changes as I showed previously.

Setting weights

I talked about this in my previous comments, but to reiterate normalized linear combination makes setting weights logical / intuitive. If the sum of all weights = 1, then any given weight represents the percentage that signal contributes to the final score. If the sum of all weights > 1, then the weights act like boosts. A weight of 2 means that the signal contributes twice as much to the final score relative to other signals.

Summary

I am sure you could find an example set of numbers where normalization and addition does not magically work or where normalization is not needed, however that isn't the point. I am attempting to illustrate that when the scale of the different signals differ significantly or when the BM25 score scale changes significantly between queries, the weights have to be updated to prevent one signal overwhelming the others. Practically speaking, having to change the weights for every query or as the data changes is not intuitive. It is not better. Normalization creates one scale no matter the query or data meaning I only have to find one set of good weights. It is more intuitive. I think it is better.

rpedela commented 6 years ago

Normalization algorithms require not only to know maximum score, but also minimum score. And for the whole document collection this could be an expensive operation.

@mayya-sharipova Dividing by the max score is turning the score into a percentage. The min value then becomes zero. There may be other more advanced normalization algorithms that require finding the minimum score, but in my experience simply dividing by the max score works well enough.

mayya-sharipova commented 6 years ago

@rpedela thanks for such a good illustrative write up. It nicely illustrates the problem, and the need for normalization. About the implementation, we want to implement something that 1) addresses a broad range of use cases 2) does't put much constrains on the system (CPU, memory) and 3) and ideally backed up by the literature from the Information Retrieval field.
We are still in the process to finding a good way for using normalization.

JnBrymn-EB commented 6 years ago

Wow. Lots of good stuff here @rpedela. Great example that make me think.

Here's roughly where I keep landing though. I really want the type of normalized signals that you are talking about, but I'm not sure they are achievable. It's not that the technology isn't there - it's easy enough to divide by max score - but rather I don't know that an absolute score is even knowable. I feel we are only left with relative scores.

I'll try to explain why I feel this way. Let's define "absolute relevance score" (ARS... I like that) as a number between 0 and 1 where 0 is "of no relevance" and 1 is "of best possible relevance". And let's define "relative relevance score" (RRS) as the same thing as bm25_score described above. Even at this point we can see some issue in that "of best possible relevance" seems pretty ill defined. But let me give a concrete example showing that the ARS isn't knowable. Lets say that you're hungry and you make a query for "hamburger" in our search app. You get 3 results back whose RRS are a)10.0, b)20.0, and c)40.0. In the conversation to this point we haven't really disputed bm25 as being good a pretty good ordinal ranking algorithm. So we "know" that c) is better than b) is better than a).

But now let's consider how we might find and ARS for this. We don't really know the absolute relationship between a), b), and c). But if an absolute score can be found then we'd probably say that c) is twice as good as b), and b) is twice as good as a). But we still don't know how good they are compared to "the best possible relevance". If we divide by the max value in the list then we make an implicit assumption that c) is the best in the world, ARS=1.0 (and b) is half as good and a) is a quarter as good). And then if we want to incorporate a different signal into this, like the rating, then it would be easy enough to just use a linear combination like @jpountz presented yesterday.

Here's the kicker in my silly example. You searched for "hamburger" but let's say that you accidentally searched for "hamburger" in the IMDB search app. "hamburger" is a rare term in an index of movies, bm25 will score it highly (again proving that bm25 is a horrible measure of absolute relevance). And even though all docs matching "hamburger" get high scores compared to typical searches, they are all of low relevance. Document c) is not the most relevant document in the world as our assumption implied. A "truer" normalization would be something like "divide by 400" rather than "divide by 4" and we would have sorted the results mostly by rating rather than text relevance.

This is an extreme example but it makes my point. How do we know the difference between result that score highly because they are relevant or highly because the corpus just doesn't have much support for a particular term?

JnBrymn-EB commented 6 years ago

Hm... but you're making me think @rpedela. Maybe you don't want the absolute measure of relevance so much as you want the text portion of the score to fall between 0 and 1. Basically you say "I don't care how relevant these results are... they are at least the most relevant things we have to offer based upon text, and all other signals (distance, rating, etc.) are subordinate to text relevance." In some cases I actually think that's pretty reasonable! (Actually, I think I was getting at this with my example.)

So what we want is a way of taking the result of an aggregation (in this case max(bm25_score)) and using the output of the aggregation in the score of the documents. The tough part is that we would have to run through the result set twice, which means CPU intensive OR memory intensive. That's rough - though on the other hand maybe we don't care in some cases.

Is this basically what you're lobbying for @rpedela?

rpedela commented 6 years ago

Is this basically what you're lobbying for @rpedela?

@JnBrymn-EB Yes. I want to point out that the static signals don't necessarily have to be subordinate to text relevance. It depends on the use case. Here is an example equation if you want text relevance to account for 40% of the final score and popularity to account for 60%.

final_score = 0.4 * normalize(bm25_score) + 0.6 * normalize(log(popularity))

The tough part is that we would have to run through the result set twice, which means CPU intensive OR memory intensive.

Yes which is why I would love to see this implemented in the server. The workaround example I posted in my first comment, executes the same query twice client-side. Server-side can avoid an extra network round-trip and other overhead.

Let's define "absolute relevance score" (ARS... I like that) as a number between 0 and 1 where 0 is "of no relevance" and 1 is "of best possible relevance".

If you are interested in measuring absolute relevance, then I suggest taking a look at the new rank evaluation API. There are three ways to get relevance judgments which can be used as input to the evaluation API: manual, crowd-source, or convert clicks. You typically need a large amount of click data, but if you have it then you can use PyClick to create a click model that can be used to generate relevance judgments.

rpedela commented 6 years ago

@jpountz I think exposing a _max_score variable through rescoring would be amazing. However I have two concerns. First, there are features which do not support rescoring such as collapsing. Second, window_size is an absolute value. I would like to see a relative_window_size (or whatever) added so one could apply _max_score to all results from the main query. There are definitely cases where a particular document may have a low BM25 score but a high popularity, recency, or geo distance score which would put the document in the top 10 depending on the weights.

prateeksachan commented 5 years ago

I was going through LTR models and I stumbled on this LTR module from Apache Solr (https://lucene.apache.org/solr/guide/6_6/learning-to-rank.html) which seems to give a workaround for normalizing scores.

It mentions Normalizer class which could be used to normalize values (like query scores). Does something like this exists for Elasticsearch?

Normalizer	Class	Example parameters
MinMax	MinMaxNormalizer	{"min":"0", "max":"50" }
Standard	StandardNormalizer	{"avg":"42", "std":"6"}

jpountz commented 5 years ago

@rpedela Sorry for the late reply. Agreed rescoring has limits. This is exactly why we have been working on things like the feature and feature_vector fields, which still allow to compute top hits efficiently (ie. without visiting all matches) when combining BM25 scores with recency, geo-distance, pagerank or popularity.

@prateeksachan Normalization may be appealing, but you need to know about all values in advance to be able to normalize, which is costly to compute. So we are sticking to ways to tune scoring that don't require collecting all values.

frankcarey commented 5 years ago

Yes, we'd also like to combine scores in a linear fashion and the wide range of the BM25 scores that are returned make things a challenge, even when I have b=0 and k=0 when multiple words are used in the query.

jpountz commented 4 years ago

Obviously this is something that can always be improved, but I'll consider this done thanks to the various features that have been added over the past months

rank_feature and rank_features fields, and their associated rank_feature query
distance_feature query

These features have been designed based on Craswell, Nick, et al. "Relevance weighting for query independent evidence", which addresses the normalization issue by giving features a contribution that behaves like the contribution of the term frequency in BM25 scores.

elastic / elasticsearch