elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.4k stars 24.56k forks source link

Fuzziness works incorrectly when using boolean similarity #75652

Open yassenb opened 3 years ago

yassenb commented 3 years ago

Elasticsearch version (bin/elasticsearch --version): 7.13.2

Plugins installed: []

JVM version (java -version): "16" 2021-03-16

OS version (uname -a if on a Unix-like system): Linux 7cf7d004f550 5.11.0-22-generic #23-Ubuntu SMP Thu Jun 17 00:34:23 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Fuzziness breaks boolean similarity scoring. Without fuzziness a document scores 1 if there's a perfect match of 1 term. With fuzziness enabled a perfect match of 1 term scores 1 again and a fuzzy match scores below 1 which is as expected. However, when there are 2 terms in the document and they both match after a fuzzy query is expanded the score is summed and thus the document is ranked above a document with 1 term and a perfect match. The perfect match (no typo corrected by fuzziness) should always rank higher. The score of the boolean similarity should be the best score for one of the rewritten terms, not the sum of all scores for all rewritten terms. In the example below the perfect match euston should score above boston selston when querying for euston

Steps to reproduce:

PUT /locations
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "similarity": "boolean",
        "norms": false,
        "index_options": "docs"
      }
    }
  }
}

POST /locations/_doc
{
  "name": "euston"
}

POST /locations/_doc
{
  "name": "boston selston"
}

GET /locations/_search
{
  "query": {
    "match": {
      "name": {
        "query": "euston",
        "operator": "and", 
        "fuzziness": 2,
        "max_expansions": 10000
      }
    }
  }
}
elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

benwtrent commented 1 month ago

My thought on this issue is, I am not sure that the single exact match should score higher than the fuzzy match on two terms.

The relevancy described here is opinion and I think the current behavior is appropriate. Especially since one can boost an exact match by combining with a boolean query that IS an exact match (a should clause to boost significantly on an exact match).

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)