elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.6k stars 24.63k forks source link

Results returned in wrong order #6639

Closed qris closed 10 years ago

qris commented 10 years ago

I noticed while writing tests that ordering by most fields worked properly, but ordering by the grantee field does not. It returns the results in the same order regardless of whether "ascending" or "descending" is selected.

I've attached a script which reproduces the issue:

chris@lap-x201:~/aptivate/2014/indigodata/360giving-demos/src/scripts$ ./elasticsearch_bug.sh 
{"acknowledged":true}
{"took":57,"errors":false,"items":[{"index":{"_index":"test_360giving","_type":"modelresult","_id":"data.activity.5","_version":1,"status":201}}]}
{"took":7,"errors":false,"items":[{"index":{"_index":"test_360giving","_type":"modelresult","_id":"data.activity.6","_version":1,"status":201}}]}
data.activity.5 data.activity.6 
data.activity.5 data.activity.6 

The very last line should say "data.activity.6 data.activity.5", because it requested the results in the opposite order compared to the previous line.

If I make the data sufficiently different (e.g. change the grantee of the second record to "H" or "F") then it works as expected.

Here is my test script:

#!/bin/bash

server=http://localhost:9200
index="test_$RANDOM"
type=modelresult

set -e

do_curl() {
    method=$1
    shift
    curl -s -X$method $server/"$@"
    echo ''
}

do_curl DELETE "$index" || true
do_curl PUT "$index"
do_curl PUT "$index/$type/mapping" --data-binary @- <<EOF
{"$type": {"_boost": {"name": "boost", "null_value": 1.0}, "properties": {"grantee": {"index": "not_analyzed", "term_vector": "with_positions_offsets", "type": "string", "analyzer": "snowball", "boost": 1.0, "store": "yes"}}}}
EOF

do_curl POST "_bulk?refresh=true" --data-binary @- <<EOF
{"index": {"_type": "$type", "_id": "data.activity.5", "_index": "$index"}}
{"django_ct": "data.activity", "grantee": "Grantee 1"}
EOF

do_curl POST "_bulk?refresh=true" --data-binary @- <<EOF
{"index": {"_type": "$type", "_id": "data.activity.6", "_index": "$index"}}
{"django_ct": "data.activity", "grantee": "Grantee 2"}
EOF

# Note: the bug is that you get [data.activity.5, data.activity.6]
# regardless of the specified sort order, as shown below. If you make
# the records sufficiently different (e.g. change the grantee of the
# second record to "H" or "F") then it works.

do_curl GET "$index/$type/_search" --data-binary '{"sort": [{"grantee": {"order": "asc"}}], "query": {"filtered": {"filter": {"fquery": {"query": {"query_string": {"query": "*"}}}}}}}' | perl -ne 'while (s/"_id":"([^"]+)"//) { print "$1 " }'; echo
do_curl GET "$index/$type/_search" --data-binary '{"sort": [{"grantee": {"order": "desc"}}], "query": {"filtered": {"filter": {"fquery": {"query": {"query_string": {"query": "*"}}}}}}}' | perl -ne 'while (s/"_id":"([^"]+)"//) { print "$1 " }'; echo
do_curl DELETE "$index"
dadoonet commented 10 years ago

The problem I can see here is that your field grantee is analyzed by default.

I don't think your test case describe an actual issue. You should try the same script but with a mapping which set your grantee field as not_analyzed.

See also: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-sort.html#_sort_mode_option

Closing. Feel free to reopen if you think it's an issue.

qris commented 10 years ago

OK, i modified the script to create a new index each time and to set grantee to not_analyzed as shown above. Can I reopen the issue?

It returns the second result in the wrong order about 50% of the time, so I sometimes need to rerun the script several times to demonstrate the bug.

qris commented 10 years ago

Also, the mapping documentation says:

By default, there isn’t a need to define an explicit mapping, since one is automatically created and registered when a new type or new field is introduced (with no performance overhead) and have sensible defaults. Only when the defaults need to be overridden must a mapping definition be provided.

Does "sensible defaults" really include "not reliably sortable"? That would be an interesting definition of "sensible" :)

s1monw commented 10 years ago

I added a test above but I can't reproduce the issue. Can you tell us which version you are using?

qris commented 10 years ago

I reproduced it by installing Ubuntu 12.04.3 (i386) from the live CD, followed by these commands:

sudo apt-get install curl openjdk-7-jre
sudo dpkg -i elasticsearch-1.2.1.deb
sudo /etc/init.d/elasticsearch start
cat > ./elasticsearch_bug.sh <<EOF ... (pasted in the script above)
chmod a+x ./elasticsearch_bug.sh
./elasticsearch_bug.sh
./elasticsearch_bug.sh

The second time I ran the script, I got the behaviour described above:

data.activity.5 data.activity.6 
data.activity.5 data.activity.6 

Please could you try to reproduce it this way?

qris commented 10 years ago

Please could someone reopen this issue? I think there is a real bug here, as I've been able to reproduce it on a clean system.

dakrone commented 10 years ago

@qris there is a typo in your script,

do_curl PUT "$index/$type/mapping" --data-binary @- <<EOF

should be:

do_curl PUT "$index/$type/_mapping" --data-binary @- <<EOF

As a result, your mapping is not being applied (instead you're indexing a document with the id of "mapping"). If I correct this the sorting works correctly.

qris commented 10 years ago

Thanks @dakrone, you were right, fixing that made it work and helped me to find the problem in my code!

I would note that this behaviour is really unintuitive. I think it would be better to fail to sort on an analyzed field rather than pretend to do so, and return incorrect results.

But even better would be to make it work as expected. Since the original field value is usually stored, why can't we sort on it? Is it not indexed? Could it be?