CottageLabs / idfind

An identifier identifier
1 stars 0 forks source link

elasticsearch returning less results if sorting is applied #31

Open emanuil-tolev opened 12 years ago

emanuil-tolev commented 12 years ago
tests = idfind.dao.Test.query(sort=[{"name":"asc"}]) # get all the tests
tests = idfind.dao.Test.query() # get all the tests

The tests in the index are named "numbas", "worse numbas" and "nepra". The first line returns "nepra", "numbas". The second line returns "worse numbas", "numbas", "nepra".

It seems to be skipping the name with the space in it if sorting is applied.

Any suggestions? I'm not familiar enough with ES internals, so if it's something to do with how the stuff is mapped in the index to aid searching, it'd be easier if I just learned it from a comment instead of hunting for a possible cause.

emanuil-tolev commented 12 years ago

Um, OK, elasticsearch can't sort by name with the default configuration, since having a space means the field is recorded as having 2 values/tokens (so that a search for one of the words in the name will return that doc from the index). However, it does mean ES can't sort it properly.

The important bit from the exception below (or the one I think important..) is: Caused by: java.io.IOException: Can't sort on string types with more than one value per doc, or more than one token per field

This is the full exception ES throws up: [2012-05-07 21:33:43,099][DEBUG][action.search.type ] [Norns] [idfind][1], node[PBJKZ6Q_RNmtxa3SrCr06w], [P], s[STARTED]: Failed to execute [org.elasticsearch.ac tion.search.SearchRequest@224002] org.elasticsearch.search.query.QueryPhaseExecutionException: [idfind][1]: query[ConstantScore(NotDeleted(cache(_type:test)))],from[0],size[10],sort[<custom:"name": org .elasticsearch.index.field.data.strings.StringFieldDataType$1@448b7f>]: Query Failed [Failed to execute main query] at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:198) at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:234) at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:140) at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:80) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:204) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:191) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$2.run(TransportSearchTypeAction.java:177) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: Can't sort on string types with more than one value per doc, or more than one token per field at org.elasticsearch.index.field.data.strings.StringOrdValFieldDataComparator.setNextReader(StringOrdValFieldDataComparator.java:123) at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:95) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:576) at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:195) at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:149) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:487) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:400) at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:192) ... 9 more

emanuil-tolev commented 12 years ago

And the actual query JSON (so you can test manually with a request without using the python code at all) [so - you need to HTTP POST this to your elasticsearch instance, putting the JSON below in the body of the request]:

{
    "sort" : [
        { "name" : "asc" }
    ],
    "query" : {
        "match_all" : { }
    }
}

And the result that gave me (explaining why 1 of my tests was missing from the results):

{ "_shards" : { "failed" : 1,
      "failures" : [ { "index" : "idfind",
            "reason" : "QueryPhaseExecutionException[[idfind][1]: query[ConstantScore(NotDeleted(cache(_type:test)))],from[0],size[10],sort[<custom:\"name\": org.elasticsearch.index.field.data.strings.StringFieldDataType$1@1d88d65>]: Query Failed [Failed to execute main query]]; nested: IOException[Can't sort on string types with more than one value per doc, or more than one token per field]; ",
            "shard" : 1,
            "status" : 500
          } ],
      "successful" : 4,
      "total" : 5
    },
  "hits" : { "hits" : [ { "_id" : "0220c3d595da40ca861daec793a2ac3a",
            "_index" : "idfind",
            "_score" : null,
            "_source" : { "auto_succeeded" : 1,
                "created" : "2012-05-07T17:16:05.807000",
                "description" : "",
                "id" : "0220c3d595da40ca861daec793a2ac3a",
                "modified" : "2012-05-07T17:21:03.637000",
                "name" : "nepra",
                "owner" : "emanuil_tolev",
                "ratings" : [ { "comment" : "bit useless, just identifies 1 single nonsensical word!",
                      "created" : "2012-05-07T17:18:04.410000",
                      "identifier" : "",
                      "modified" : "2012-05-07T17:18:04.410000",
                      "owner" : "emanuil_tolev",
                      "test_worked" : true
                    },
                    { "comment" : "comment",
                      "created" : "2012-05-07T17:21:03.629000",
                      "identifier" : "nepra",
                      "modified" : "2012-05-07T17:21:03.629000",
                      "owner" : "emanuil_tolev",
                      "test_worked" : true
                    }
                  ],
                "regex" : "nepra",
                "resptest" : "",
                "resptest_cond" : "",
                "resptest_type" : "",
                "score_feedback" : 2,
                "tags" : [  ],
                "url_prefix" : "",
                "url_suffix" : "",
                "useful_links" : [ "" ],
                "votes_feedback" : 2
              },
            "_type" : "test",
            "sort" : [ "nepra" ]
          },
          { "_id" : "4d485c92f62446b2859294e521e20ebc",
            "_index" : "idfind",
            "_score" : null,
            "_source" : { "auto_succeeded" : 0,
                "created" : "2012-05-07T17:20:04.305000",
                "description" : "",
                "id" : "4d485c92f62446b2859294e521e20ebc",
                "modified" : "2012-05-07T17:36:52.852000",
                "name" : "numbas",
                "owner" : "emanuil_tolev",
                "ratings" : [ { "comment" : "",
                      "created" : "2012-05-07T17:36:52.850000",
                      "identifier" : "",
                      "modified" : "2012-05-07T17:36:52.850000",
                      "owner" : "emanuil_tolev",
                      "test_worked" : true
                    } ],
                "regex" : "([0-9]+)",
                "resptest" : "",
                "resptest_cond" : "",
                "resptest_type" : "",
                "score_feedback" : 1,
                "tags" : [ "simple",
                    "correct",
                    "numbers only"
                  ],
                "url_prefix" : "",
                "url_suffix" : "",
                "useful_links" : [ "" ],
                "votes_feedback" : 1
              },
            "_type" : "test",
            "sort" : [ "numbas" ]
          }
        ],
      "max_score" : null,
      "total" : 2
    },
  "timed_out" : false,
  "took" : 3
}