Better cross-field support

jimczi commented 5 years ago

Lucene has a new query in the sandbox called BM25FQuery. It is similar to the BlendedTermQuery that we use for cross-fields search but the new query also merges the document statistics (freq + norm) in a way that preserve the benefits of using BM25 formula (term frequency saturates quickly). There are some work to be done on the Lucene side to improve the integration of this query: https://issues.apache.org/jira/browse/LUCENE-8710 https://issues.apache.org/jira/browse/LUCENE-8711 However we could already replace the BlendedTermQuery with BM25F since the main logic is in place and should already improve the ranking of documents when cross-fields mode is used in a query.

elasticmachine commented 5 years ago

Pinging @elastic/es-search

markharwood commented 5 years ago

Would this preserve the small bias cross-fields gives towards the "right" field for a term? Rather than a complete flattening of DF across all fields cross-fields gives the most-popular field the blended DF but all other fields blended DF +1 to ensure they rank lower. I think that's a subtle but important feature to maintain.

jimczi commented 5 years ago

Would this preserve the small bias cross-fields gives towards the "right" field for a term?

No, the BM25FQuery can deboost any field but there is no extra bias based on the inner frequency of each field. I am not sure why this would be an important feature to maintain, it seems arbitrary to me to pick the "right" field as the most-popular one. What is the benefit ? I think that picking the boost per field should be done based on prior knowledge (title is more important than body) and/or automatic learning. Wee could also automatically deboost all fields except the most popular one but I failed to see why this would be better.

markharwood commented 5 years ago

it seems arbitrary to me to pick the "right" field as the most-popular one. What is the benefit ?

Searching for Mark Harwood across firstname and lastname fields should certainly favour any firstname:Mark over a lastname:Mark. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.

jimczi commented 5 years ago

I agree that in this case lastname should not be preferred because Mark is rare in this context but I don't think we should automatically boost the firstname. With default BM25F a document that contains the term in the firstname field would have the same score than a document that contains the same term in the lastname field. If we start applying automatic boost on top of the default we'll have issues to mix them with the explicit boost that the user can set on individual fields.

markharwood commented 5 years ago

a document that contains the term in the firstname field would have the same score than a document that contains the same term in the lastname field.

That was the problem I was hoping to avoid.

we'll have issues to mix them with the explicit boost that the user can set on individual fields.

We very nearly didn't provide the facility for user-supplied boosts with cross-fields. My preference was to error if supplied. We need to think about the reasons for having different fields in the first place :

1) Summary/detail - this is the title vs body scenario. Terms do not change their meaning when moved from one field to the other. 2) Semantic context - this is the firstname vs lastname scenario. Terms do change their meaning when moved from one field to the other.

In scenario 1) it is appropriate for an elastic admin to say "boost field X over Y for all queries". In scenario 2 it is more appropriate for the terms in each user query to be considered by the cross-fields scoring on a case-by-case basis, determining what is the correct field context and biasing matches towards that. The only reason we kept support for any admin-supplied field boosts in cross-fields was where the admin was (ab)using cross-fields' flattening of IDF while tackling scenario 1). This muddied the waters in my view. For me cross-fields was always about scenario 2.

jimczi commented 5 years ago

Ok thanks for explaining. IMO BM25F is really about 1) and I wonder if 2) could be tackled differently. For instance if you have 2 fields, street and city, a query for oxford street would boost oxford in the city field because it seems more popular. There are also counter-examples like your example but I think we should decorrelate these use cases. The semantic context could use BM25F to match documents but I feel like it needs to do more things than just analyzing the idf of individual terms while for 1) BM25F can be use directly.

markharwood commented 5 years ago

I think "cross-fields" could perhaps have been better named as "correct field" - meaning it tries to determine which is the appropriate semantic context for each term and boosts accordingly.

Maybe this "correct field" technique should only apply to single-term fields for now. In your street vs city example the term oxford is further qualified by other terms in the field e.g. road/rd/street/st. If the field only has one term there's no additional context to consider and I suspect the existing approach works well. For multi-term fields I guess we could get into more complex attempts at query-understanding using shingles or phrase queries but that's perhaps for another day.

jtibshirani commented 3 years ago

Here's an idea for a path forward:

Generalize BM25FQuery to handle most similarities. This would allow us to handle all the built-in similarities, including our default LegacyBM25Similarity. (LUCENE-9725)
Introduce a new option combined_fields that uses this query. If there's a built-in similarity as default, we can use it directly. If there are custom similarities or per-field similarities, fall back to the default LegacyBM25Similarity.
Deprecate cross_fields in favor of the other options.

This would already provide an alternative to cross_fields in most situations. Users can switch onto it to avoid the broken cross_fields scoring.

An alternative would be to avoid a new option, and just modify cross_fields to use the new query whenever possible. This seemed like a worse direction, since we'd change the scoring strategy without alerting users. It also feels confusing that the same mode could use two different scoring models.

jpountz commented 3 years ago

This sounds like a plan @jtibshirani :+1:.

jtibshirani commented 3 years ago

Great, I'll work on this. We can consider LUCENE-8710 and LUCENE-8711 as a separate follow-up. Note I edited the plan above to remove a step that didn't make sense.

jtibshirani commented 3 years ago

After working through an initial version and more discussion, I think this addition would fit better as a new query type combined_field instead of a new multi_match mode.

The reasoning is that multi_match is designed to provide a good, lenient default when searching over many fields. Like the match query it handles most field types, even non-text ones, and is fine if text fields do not share the same analyzer. It also helps power our query_string logic, which accepts queries and fields directly from the user and needs to be very flexible.

In contrast, BM25F is designed to combine scores across text fields. The field boosts have a very specific meaning, they must be at least 1.0f and represent a multipler on term frequency in the combined synthetic field. As @jimczi noted in this comment, there is not a clear way to incorporate scoring contributions from non-text types. There are also several options on multi_match that don’t apply. By introducing BM25F as part of multi_match we may be trying to solve too many cases at once: principled text-field scoring, plus providing a lenient default for searching over all fields.

The new combined_field query would have this behavior:

It would only accept text fields. Because of this requirement, the query would not fall back to index.query.default_field.
It would omit options that don’t apply like tie_breaker, phrase_slop, fuzziness.
For now, we would require all fields to have the same analyzer. We wouldn’t adopt the current strategy in cross_fields where we group fields by analyzer and combine their scores through dis_max. (Maybe we could improve this later, by recognizing when analyzers produce the same positions and allowing those fields to be searched together).

I believe this would still cover many use cases where users rely on cross_fields. But unfortunately, it makes it less of a 'direct replacement'. There might be additional research/ work to understand what it would take to fix cross_fields or provide the right alternatives.

It’d be great to get your feedback on the above. Also a note that if point 3 proves too restrictive, we could consider a lenient option that allows text fields with different analyzers and performs a simple grouping.

jtibshirani commented 3 years ago

I merged #71213 which adds a new combined_fields query. I'll close out this issue, since we've completed the main integration of BM25F.

The plan for next steps:

Gather user feedback about combined_fields. Identify any difficulties switching or functionality that's missing from combined_fields.
Based on this feedback, decide whether to deprecate and remove the cross_fields option.

elastic / elasticsearch

Better cross-field support #41106