elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.7k stars 24.39k forks source link

multi_match query different scoring results 1.7 - 2.1 #16535

Closed mfussenegger closed 8 years ago

mfussenegger commented 8 years ago

We're in the progress of upgrading to ES 2.1 and have noticed that some queries now have different results. I'm trying to figure out the root cause. So far my guess is that it is a analyzer change within Lucene.

Here is the mapping I'm using:

{
    "settings": {
        "number_of_shards": 2
    },
    "mappings": {
        "default": {
            "dynamic": "true",
            "_all": {
                "enabled": false
            },
            "properties": {
                "description": {
                    "type": "string",
                    "index": "not_analyzed",
                    "doc_values": true,
                    "copy_to": [
                        "name_description_ft"
                    ]
                },
                "id": {
                    "type": "string",
                    "index": "not_analyzed",
                    "doc_values": true
                },
                "kind": {
                    "type": "string",
                    "index": "not_analyzed",
                    "doc_values": true
                },
                "name": {
                    "type": "string",
                    "index": "not_analyzed",
                    "doc_values": true,
                    "copy_to": [
                        "name_description_ft"
                    ]
                },
                "name_description_ft": {
                    "type": "string",
                    "analyzer": "english"
                }
            }
        }
    }
}

Here are the records:

{"index": {"_id": "1"}}
{"id":"1","name":"North West Ripple","kind":"Galaxy","description":"Relative to life on NowWhat, living on an affluent world in the North West ripple of the Galaxy is said to be easier by a factor of about seventeen million."}
{"index": {"_id": "2"}}
{"id":"2","name":"Outer Eastern Rim","kind":"Galaxy","description":"The Outer Eastern Rim of the Galaxy where the Guide has supplanted the Encyclopedia Galactica among its more relaxed civilisations."}
{"index": {"_id": "3"}}
{"id":"3","name":"Galactic Sector QQ7 Active J Gamma","kind":"Galaxy","description":"Galactic Sector QQ7 Active J Gamma contains the Sun Zarss, the planet Preliumtarn of the famed Sevorbeupstry and Quentulus Quazgar Mountains."}
{"index": {"_id": "4"}}
{"id":"4","name":"Aldebaran","kind":"Star System","description":"Max Quordlepleen claims that the only thing left after the end of the Universe will be the sweets trolley and a fine selection of Aldebaran liqueurs."}
{"index": {"_id": "5"}}
{"id":"5","name":"Algol","kind":"Star System","description":"Algol is the home of the Algolian Suntiger, the tooth of which is one of the ingredients of the Pan Galactic Gargle Blaster."}
{"index": {"_id": "6"}}
{"id":"6","name":"Alpha Centauri","kind":"Star System","description":"4.1 light-years northwest of earth"}
{"index": {"_id": "7"}}
{"id":"7","name":"Altair","kind":"Star System","description":"The Altairian dollar is one of three freely convertible currencies in the galaxy, though by the time of the novels it had apparently recently collapsed."}
{"index": {"_id": "8"}}
{"id":"8","name":"Allosimanius Syneca","kind":"Planet","description":"Allosimanius Syneca is a planet noted for ice, snow, mind-hurtling beauty and stunning cold."}
{"index": {"_id": "9"}}
{"id":"9","name":"Argabuthon","kind":"Planet","description":"It is also the home of Prak, a man placed into solitary confinement after an overdose of truth drug caused him to tell the Truth in its absolute and final form, causing anyone to hear it to go insane."}
{"index": {"_id": "10"}}
{"id":"10","name":"Arkintoofle Minor","kind":"Planet","description":"Motivated by the fact that the only thing in the Universe that travels faster than light is bad news, the Hingefreel people native to Arkintoofle Minor constructed a starship driven by bad news."}
{"index": {"_id": "11"}}
{"id":"11","name":"Bartledan","kind":"Planet","description":"An Earthlike planet on which Arthur Dent lived for a short time, Bartledan is inhabited by Bartledanians, a race that appears human but only physically."}
{"index": {"_id": "12"}}
{"id":"12","name":"","kind":"Planet","description":"This Planet doesn't really exist"}
{"index": {"_id": "13"}}
{"id":"13","name":null,"kind":"Galaxy","description":"The end of the Galaxy.%"}

And this is the query:

{
    "explain": true,
    "query": {
        "multi_match": {
            "fields": ["kind^0.8", "name_description_ft^0.6"],
            "query": "planet earth"
        }
    }
}

In 1.7 the top 2 hits are

            {
                "_id": "6",
                "_score": 0.22184466,
                "_source": {
                    "id": "6",
                    "name": "Alpha Centauri",
                    "kind": "Star System",
                    "description": "4.1 light-years northwest of earth"
                }
            },
            {
                "_id": "12",
                "_score": 0.21719791,
                "_source": {
                    "id": "12",
                    "name": "",
                    "kind": "Planet",
                    "description": "This Planet doesn't really exist"
                }
            },

and in 2.1 they are:

             {
                "_id": "6",
                "_score": 0.2600391,
                "_source": {
                    "id": "6",
                    "name": "Alpha Centauri",
                    "kind": "Star System",
                    "description": "4.1 light-years northwest of earth"
                }
            },
            {
                "_id": "11",
                "_score": 0.15168947,
                "_source": {
                    "id": "11",
                    "name": "Bartledan",
                    "kind": "Planet",
                    "description": "An Earthlike planet on which Arthur Dent lived for a short time, Bartledan is inhabited by Bartledanians, a race that appears human but only physically."
                }
            },

I've also tried to do a snapshot on 1.7 and then restore the snapshot in 2.1. This results in 2.1 producing the same result as in 1.7 which is why I assume that something at indexing time is now handled differently.

I've also tried to see if the _analyze API returns a different result somewhere. But all descriptions are tokenized the same in 1.7 and 2.1 - the only difference is that in one version position is starting from 0 and in the other version it's starting from 1.

Maybe I'm missing something obvious here,

rmuir commented 8 years ago

"settings": { "number_of_shards": 2 },

This is enough to do it. If you want consistent scoring, you need to enable distributed term statistics, otherwise the IDF values used are based on local information, not global.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html

mfussenegger commented 8 years ago

But shouldn't the local IDF values be deterministic if _id isn't randomly generated? The shard allocation should be deterministic which should result in the shards/lucene-indices having the same documents.

You're right in that if I change number_of_shards to 1 I get the same results in both 1.7 and 2.1 - could it be that the routing allocation algorithm changed?

mfussenegger commented 8 years ago

Okay, thanks for the pointer in the right direction. Murmur is now used - it's even documented in the mapping changes. Due to that the distribution is different from before and due to that the scoring is different.