elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.05k stars 24.83k forks source link

Term centric minimum should match on multiple fields with different analyzers #90168

Open gossu opened 2 years ago

gossu commented 2 years ago

Description

Currently the minimum should match parameter is interpreted differently by different query clauses:

We are maintaining a search solution with well structured documents indexed. Each document consists of multiple searchable fields (e.g. category, shopName, productCode, brandName ...) and each field, due to the nature of data being stored there, is being treated with different index/search analysers. For relevance reasons we are also giving different weights/boosts to each field when querying them.

Our users perform multi word searches quite frequently. We would like to be able to increase search precision by establishing consistent term-centric minimum should match rules on our queries (e.g. every time users type 3 words at least 2 of them must match)

It seems that currently it is not possible to achieve a term centric minimum should match on multiple fields with different analysers (as in the title) query behaviour in Elasticsearch.

So far we have resorted to having a separate "catchAll" field that contains data from all other fields. This field is then setup to use the most recall-friendly index/search analysers. We filter all our queries using this "catchAll" field and applying minimum should match rules on this filter. Only documents that pass through such filtering are then handled by our "main" query (each field queried separately with different analysers and boosts).

This workaround however is not perfect. We are continuously adding new searchable fields or improving the way the existing ones are being analysed. We must therefore sync these changes with the catchAll field analysers, to make sure they are not more strict than the union of all other analysers... in all cases. Ensuring this is probably impossible so we must accept that are solution is "incorrect" for some queries.

I would like to see a query clause that is capable of

I understand this is a complex task, as each search analyser can emit different number of tokens, so there is no universal way of telling how many tokens the query consists of. However I think it should be possible to redefine minimum should match constraints in this case, maybe they could be declared on character level instead (e.g. 80% of user input characters should match)?

Example vision of how this could work:

Document: FieldA: ["A", "B", "D", "X"] FieldB: ["CD", "XY"]

User query: "ABCDE" Minimum should match declared to "80%"

FieldA search tokenizer splits the query into ["A","B","C","D","E"] therefore ["A", "B", "D"] give a match FieldB search tokenizer splits the query into ["AB","BC","CD","DE"] therefore ["CD"] gives a match

Only 3/5 FieldA tokens gave a match and only 1/5 FieldB tokens gave a match but if we look at an overlapping string that matched we see that "ABCD" part of the string produced matches in at least one of the fields queried. This is exactly 80% of the user query length ("ABCDE") therefore we can declare the document as matched by the query.

Does this sound like a reasonable idea? Is it something that could be added to Elasticsearch (as a new query clause? as an alternative minimum should match mode?)?

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-ql (Team:QL)

luigidellaquila commented 2 years ago

I see this was flagged as QL/EQL issue, but it seems more a Search related issue. I'm changing the labels accordingly

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

foxstarius commented 1 year ago

Any updates on this? This is a challenge I struggle with quite often.

KennyLindahl commented 1 year ago

Is this being worked on or what kind of priority have you given the issue? Thanks

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)