Closed jimczi closed 3 years ago
Pinging @elastic/es-search
Would this preserve the small bias cross-fields gives towards the "right" field for a term?
Rather than a complete flattening of DF across all fields cross-fields gives the most-popular field the blended DF
but all other fields blended DF +1
to ensure they rank lower. I think that's a subtle but important feature to maintain.
Would this preserve the small bias cross-fields gives towards the "right" field for a term?
No, the BM25FQuery can deboost any field but there is no extra bias based on the inner frequency of each field. I am not sure why this would be an important feature to maintain, it seems arbitrary to me to pick the "right" field as the most-popular one. What is the benefit ? I think that picking the boost per field should be done based on prior knowledge (title is more important than body) and/or automatic learning. Wee could also automatically deboost all fields except the most popular one but I failed to see why this would be better.
it seems arbitrary to me to pick the "right" field as the most-popular one. What is the benefit ?
Searching for Mark Harwood
across firstname
and lastname
fields should certainly favour any firstname:Mark
over a lastname:Mark
. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.
I agree that in this case lastname
should not be preferred because Mark
is rare in this context but I don't think we should automatically boost the firstname
. With default BM25F a document that contains the term in the firstname
field would have the same score than a document that contains the same term in the lastname
field. If we start applying automatic boost on top of the default we'll have issues to mix them with the explicit boost that the user can set on individual fields.
a document that contains the term in the firstname field would have the same score than a document that contains the same term in the lastname field.
That was the problem I was hoping to avoid.
we'll have issues to mix them with the explicit boost that the user can set on individual fields.
We very nearly didn't provide the facility for user-supplied boosts with cross-fields. My preference was to error if supplied. We need to think about the reasons for having different fields in the first place :
1) Summary/detail - this is the title
vs body
scenario. Terms do not change their meaning when moved from one field to the other.
2) Semantic context - this is the firstname
vs lastname
scenario. Terms do change their meaning when moved from one field to the other.
In scenario 1) it is appropriate for an elastic admin to say "boost field X over Y for all queries". In scenario 2 it is more appropriate for the terms in each user query to be considered by the cross-fields scoring on a case-by-case basis, determining what is the correct field context and biasing matches towards that. The only reason we kept support for any admin-supplied field boosts in cross-fields was where the admin was (ab)using cross-fields' flattening of IDF while tackling scenario 1). This muddied the waters in my view. For me cross-fields was always about scenario 2.
Ok thanks for explaining. IMO BM25F is really about 1) and I wonder if 2) could be tackled differently. For instance if you have 2 fields, street
and city
, a query for oxford street
would boost oxford
in the city
field because it seems more popular. There are also counter-examples like your example but I think we should decorrelate these use cases. The semantic context could use BM25F to match documents but I feel like it needs to do more things than just analyzing the idf of individual terms while for 1) BM25F can be use directly.
I think "cross-fields" could perhaps have been better named as "correct field" - meaning it tries to determine which is the appropriate semantic context for each term and boosts accordingly.
Maybe this "correct field" technique should only apply to single-term fields for now. In your street
vs city
example the term oxford
is further qualified by other terms in the field e.g. road/rd/street/st
. If the field only has one term there's no additional context to consider and I suspect the existing approach works well. For multi-term fields I guess we could get into more complex attempts at query-understanding using shingles or phrase queries but that's perhaps for another day.
Here's an idea for a path forward:
BM25FQuery
to handle most similarities. This would allow us to handle all the built-in similarities, including our default LegacyBM25Similarity
. (LUCENE-9725)combined_fields
that uses this query. If there's a built-in similarity as default, we can use it directly. If there are custom similarities or per-field similarities, fall back to the default LegacyBM25Similarity
.cross_fields
in favor of the other options.This would already provide an alternative to cross_fields
in most situations. Users can switch onto it to avoid the broken cross_fields
scoring.
An alternative would be to avoid a new option, and just modify cross_fields
to use the new query whenever possible. This seemed like a worse direction, since we'd change the scoring strategy without alerting users. It also feels confusing that the same mode could use two different scoring models.
This sounds like a plan @jtibshirani :+1:.
Great, I'll work on this. We can consider LUCENE-8710 and LUCENE-8711 as a separate follow-up. Note I edited the plan above to remove a step that didn't make sense.
After working through an initial version and more discussion, I think this addition would fit better as a new query type combined_field
instead of a new multi_match
mode.
The reasoning is that multi_match
is designed to provide a good, lenient default when searching over many fields. Like the match
query it handles most field types, even non-text ones, and is fine if text fields do not share the same analyzer. It also helps power our query_string
logic, which accepts queries and fields directly from the user and needs to be very flexible.
In contrast, BM25F is designed to combine scores across text fields. The field boosts have a very specific meaning, they must be at least 1.0f and represent a multipler on term frequency in the combined synthetic field. As @jimczi noted in this comment, there is not a clear way to incorporate scoring contributions from non-text types. There are also several options on multi_match
that don’t apply. By introducing BM25F as part of multi_match
we may be trying to solve too many cases at once: principled text-field scoring, plus providing a lenient default for searching over all fields.
The new combined_field
query would have this behavior:
text
fields. Because of this requirement, the query would not fall back to index.query.default_field
.tie_breaker
, phrase_slop
, fuzziness
.cross_fields
where we group fields by analyzer and combine their scores through dis_max
. (Maybe we could improve this later, by recognizing when analyzers produce the same positions and allowing those fields to be searched together).I believe this would still cover many use cases where users rely on cross_fields
. But unfortunately, it makes it less of a 'direct replacement'. There might be additional research/ work to understand what it would take to fix cross_fields
or provide the right alternatives.
It’d be great to get your feedback on the above. Also a note that if point 3 proves too restrictive, we could consider a lenient
option that allows text fields with different analyzers and performs a simple grouping.
I merged #71213 which adds a new combined_fields
query. I'll close out this issue, since we've completed the main integration of BM25F.
The plan for next steps:
combined_fields
. Identify any difficulties switching or functionality that's missing from combined_fields
.cross_fields
option.
Lucene has a new query in the sandbox called BM25FQuery. It is similar to the BlendedTermQuery that we use for cross-fields search but the new query also merges the document statistics (freq + norm) in a way that preserve the benefits of using BM25 formula (term frequency saturates quickly). There are some work to be done on the Lucene side to improve the integration of this query: https://issues.apache.org/jira/browse/LUCENE-8710 https://issues.apache.org/jira/browse/LUCENE-8711 However we could already replace the BlendedTermQuery with BM25F since the main logic is in place and should already improve the ranking of documents when cross-fields mode is used in a query.