Open gossu opened 2 years ago
Pinging @elastic/es-ql (Team:QL)
I see this was flagged as QL/EQL issue, but it seems more a Search related issue. I'm changing the labels accordingly
Pinging @elastic/es-search (Team:Search)
Any updates on this? This is a challenge I struggle with quite often.
Is this being worked on or what kind of priority have you given the issue? Thanks
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Description
Currently the minimum should match parameter is interpreted differently by different query clauses:
bool
) it applies constraints on number of child clauses that need to be satisfiad in order for a document to be returned (clause-centric)match
) it applies constraints on number of terms required to match in order for a document to be returned (term-centric)multi_match
) the minimum should match is either applied on a per field basis OR it can be term-centric, but only as long as we use the same search analyzer for all the fields queried (combined_fields
)We are maintaining a search solution with well structured documents indexed. Each document consists of multiple searchable fields (e.g.
category
,shopName
,productCode
,brandName
...) and each field, due to the nature of data being stored there, is being treated with different index/search analysers. For relevance reasons we are also giving different weights/boosts to each field when querying them.Our users perform multi word searches quite frequently. We would like to be able to increase search precision by establishing consistent term-centric minimum should match rules on our queries (e.g. every time users type 3 words at least 2 of them must match)
It seems that currently it is not possible to achieve a term centric minimum should match on multiple fields with different analysers (as in the title) query behaviour in Elasticsearch.
So far we have resorted to having a separate "catchAll" field that contains data from all other fields. This field is then setup to use the most recall-friendly index/search analysers. We filter all our queries using this "catchAll" field and applying minimum should match rules on this filter. Only documents that pass through such filtering are then handled by our "main" query (each field queried separately with different analysers and boosts).
This workaround however is not perfect. We are continuously adding new searchable fields or improving the way the existing ones are being analysed. We must therefore sync these changes with the catchAll field analysers, to make sure they are not more strict than the union of all other analysers... in all cases. Ensuring this is probably impossible so we must accept that are solution is "incorrect" for some queries.
I would like to see a query clause that is capable of
I understand this is a complex task, as each search analyser can emit different number of tokens, so there is no universal way of telling how many tokens the query consists of. However I think it should be possible to redefine minimum should match constraints in this case, maybe they could be declared on character level instead (e.g. 80% of user input characters should match)?
Example vision of how this could work:
Document: FieldA: ["A", "B", "D", "X"] FieldB: ["CD", "XY"]
User query: "ABCDE" Minimum should match declared to "80%"
FieldA search tokenizer splits the query into ["A","B","C","D","E"] therefore ["A", "B", "D"] give a match FieldB search tokenizer splits the query into ["AB","BC","CD","DE"] therefore ["CD"] gives a match
Only 3/5 FieldA tokens gave a match and only 1/5 FieldB tokens gave a match but if we look at an overlapping string that matched we see that "ABCD" part of the string produced matches in at least one of the fields queried. This is exactly 80% of the user query length ("ABCDE") therefore we can declare the document as matched by the query.
Does this sound like a reasonable idea? Is it something that could be added to Elasticsearch (as a new query clause? as an alternative minimum should match mode?)?