atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Kuromoji_tokenizer: sort clause does not seem to work for some specific character combinations #141

Open ajaypvymo opened 5 months ago

ajaypvymo commented 5 months ago

Query:

{
  "query": {
    "bool": {
    }
  },
  "sort": [
    {
      "attribute.sortable": {
        "order": "asc"
      }
    }
  ]
}

Results:

"hits": [
  {
    "_index": "example_1",
    "_type": "example_1",
    "_id": "A2Ff26qFaV",
    "_score": null,
    "_source": {
      "attributes": {
        "attribute": "サヨ",
      }
    },
    "sort": [
      "サヨ"
    ]
  },
  {
    "_index": "example_2",
    "_type": "example_2",
    "_id": "A2Ff26qFaV",
    "_score": null,
    "_source": {
      "attributes": {
        "attribute": "シヨ",
      }
    },
    "sort": [
      "シ"
    ]
  }
]

The sort is working on the characters in attribute field for example_1 doc but not for example_2 doc.

Observed this in 3 instances in total for these strings: