codelibs / elasticsearch-analysis-kuromoji-ipadic-neologd

Elasticsearch's Analyzer for Kuromoji with Neologd
Apache License 2.0
114 stars 26 forks source link

Tokenization fails when some kind of katakana appeared followed by (株) #14

Open jnory opened 5 years ago

jnory commented 5 years ago

Hi,

Firstly, I want to tell you thank you for this great product.

I recently noticed that this plugin fails to tokenize text on some context.

For example, following texts are failing to tokenize.

It seems that the plugin fails when KATAKANA word appears just after (株) but not all KATAKANA phrase fails. Would you like to check this issue?

Thanks in advance,

Reproducing Procedure

[Step 1] Create Dockerfile

FROM elasticsearch:6.5.1

RUN elasticsearch-plugin install -b org.codelibs:elasticsearch-analysis-kuromoji-neologd:6.5.1

[Step 2] Build docker image

docker build -t es6 .

[Step 3] Run elasticsearch

docker run -p 9200:9200 es6

[Step 4] Query with this text

% curl -s -H 'Content-Type:application/json' -XPOST http://localhost:9200/_analyze -d '{"tokenizer": "kuromoji_neologd_tokenizer", "text": "(株)サイゼリヤ", "explain": true}' | jq .
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "kuromoji_neologd_tokenizer",
      "tokens": []
    },
    "tokenfilters": []
  }
}