The IK Analysis plugin integrates Lucene IK analyzer into OpenSearch, support customized dictionary. Port of https://github.com/medcl/elasticsearch-analysis-ik
Apache License 2.0
40
stars
14
forks
source link
fix mistake offset due to end of useless chars #23
As the following simple case, IK analyzer responses with wrong offset fields compared with standard analyzer.
//////////////////////////////////////////////////////////////////////////
GET _analyze
{
"analyzer": "ik_max_word",
"text": ["hello, world~", "hello, ik!"]
}
//////////////////////////////////////////////////////////////////////////
We found that the problem was caused by the end of useless characters. IK analyzer does not calculate end of useless characters and set the wrong offset in "end" funtion.
As the following simple case, IK analyzer responses with wrong offset fields compared with standard analyzer. ////////////////////////////////////////////////////////////////////////// GET _analyze { "analyzer": "ik_max_word", "text": ["hello, world~", "hello, ik!"] }
{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "ENGLISH", "position" : 0 }, { "token" : "world", "start_offset" : 7, "end_offset" : 12, "type" : "ENGLISH", "position" : 1 }, { "token" : "hello", "start_offset" : 13, "end_offset" : 18, "type" : "ENGLISH", "position" : 102 }, { "token" : "ik", "start_offset" : 20, "end_offset" : 22, "type" : "ENGLISH", "position" : 103 } ] }
//////////////////////////////////////////////////////////////////////////
GET _analyze { "analyzer": "standard", "text": ["hello, world~", "hello, ik!"] }
{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "", "position" : 0 }, { "token" : "world", "start_offset" : 7, "end_offset" : 12, "type" : "", "position" : 1 }, { "token" : "hello", "start_offset" : 14, "end_offset" : 19, "type" : "", "position" : 102 }, { "token" : "ik", "start_offset" : 21, "end_offset" : 23, "type" : "", "position" : 103 } ] }
////////////////////////////////////////////////////////////////////////// We found that the problem was caused by the end of useless characters. IK analyzer does not calculate end of useless characters and set the wrong offset in "end" funtion.