Term centric minimum should match on multiple fields with different analyzers

gossu commented 2 years ago

Description

Currently the minimum should match parameter is interpreted differently by different query clauses:

on compound query clauses (e.g. bool) it applies constraints on number of child clauses that need to be satisfiad in order for a document to be returned (clause-centric)
on text query clauses (e.g. match) it applies constraints on number of terms required to match in order for a document to be returned (term-centric)
on multifield query clauses (e.g. multi_match) the minimum should match is either applied on a per field basis OR it can be term-centric, but only as long as we use the same search analyzer for all the fields queried (combined_fields)

We are maintaining a search solution with well structured documents indexed. Each document consists of multiple searchable fields (e.g. category, shopName, productCode, brandName ...) and each field, due to the nature of data being stored there, is being treated with different index/search analysers. For relevance reasons we are also giving different weights/boosts to each field when querying them.

Our users perform multi word searches quite frequently. We would like to be able to increase search precision by establishing consistent term-centric minimum should match rules on our queries (e.g. every time users type 3 words at least 2 of them must match)

It seems that currently it is not possible to achieve a term centric minimum should match on multiple fields with different analysers (as in the title) query behaviour in Elasticsearch.

So far we have resorted to having a separate "catchAll" field that contains data from all other fields. This field is then setup to use the most recall-friendly index/search analysers. We filter all our queries using this "catchAll" field and applying minimum should match rules on this filter. Only documents that pass through such filtering are then handled by our "main" query (each field queried separately with different analysers and boosts).

This workaround however is not perfect. We are continuously adding new searchable fields or improving the way the existing ones are being analysed. We must therefore sync these changes with the catchAll field analysers, to make sure they are not more strict than the union of all other analysers... in all cases. Ensuring this is probably impossible so we must accept that are solution is "incorrect" for some queries.

I would like to see a query clause that is capable of

querying multiple fields
applying different boosts to each field
applying different search analysers to each field
applying term-centric minimum should match rules on the clause level (not per-field)

I understand this is a complex task, as each search analyser can emit different number of tokens, so there is no universal way of telling how many tokens the query consists of. However I think it should be possible to redefine minimum should match constraints in this case, maybe they could be declared on character level instead (e.g. 80% of user input characters should match)?

Example vision of how this could work:

Document: FieldA: ["A", "B", "D", "X"] FieldB: ["CD", "XY"]

User query: "ABCDE" Minimum should match declared to "80%"

FieldA search tokenizer splits the query into ["A","B","C","D","E"] therefore ["A", "B", "D"] give a match FieldB search tokenizer splits the query into ["AB","BC","CD","DE"] therefore ["CD"] gives a match

Only 3/5 FieldA tokens gave a match and only 1/5 FieldB tokens gave a match but if we look at an overlapping string that matched we see that "ABCD" part of the string produced matches in at least one of the fields queried. This is exactly 80% of the user query length ("ABCDE") therefore we can declare the document as matched by the query.

Does this sound like a reasonable idea? Is it something that could be added to Elasticsearch (as a new query clause? as an alternative minimum should match mode?)?