Closed arcoyk closed 5 months ago
I assume you applied the sudachi_part_of_speech
setting in the README, but I could not reproduce your results here.
Please let us know your configuration file and the dictionaries you are using.
We are using the full dict and base configuration.
My apologies, the example provided is not quite accurate. I think the problem is actually related to a change in how baseform
, readingform
and normalizedform
are being applied..
I tried to put together a reproducible, minimal example,
{
"settings": {
"index": {
"analysis" : {
"analyzer" : {
"sudachi_analyzer" : {
"filter" : [
"sudachi_ja_stop",
"sudachi_baseform"
],
"type" : "custom",
"tokenizer" : "sudachi_tokenizer"
}
},
"tokenizer" : {
"sudachi_tokenizer" : {
"type" : "sudachi_tokenizer"
}
}
}
}
}
}
If we analyze this text,
{
"analyzer": "sudachi_analyzer",
"text": "および"
}
I expect to get no tokens, because および
is defined in stopwords.txt, and it should be removed in the sudachi_ja_stop
filter. However, I still get one token,
{
"tokens" : [
{
"token" : "および",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}
]
}
This is in v3.1.0 running on OpenSearch 2.6.0 I have confirm in our old version, v2.1.0 running on Elasticsearch 7.10.2, this expected behavior occurs (no tokens).
I see from the changelog, in v3.0.0, there was a change related to how analysis chains are processed. Is this a side-effect of that?
On further investigation, this is the exact same issue as #111 Please close if you feel necessary, but it would be nice to resolve #111
Hi,
Thank you for the great plugin. I cannot reproduce the example written in the official documentation.
Input:
Expected (as described in the document):
Actual (v3.1.0 release with Opensearch 2.6.0)
If you think of any possible causes, please leave a comment. I appreciate your assistance.