aparo / opensearch-analysis-ik

The IK Analysis plugin integrates Lucene IK analyzer into OpenSearch, support customized dictionary. Port of https://github.com/medcl/elasticsearch-analysis-ik
Apache License 2.0
40 stars 14 forks source link

fix mistake offset due to end of useless chars #23

Closed arvinsg closed 7 months ago

arvinsg commented 1 year ago

As the following simple case, IK analyzer responses with wrong offset fields compared with standard analyzer. ////////////////////////////////////////////////////////////////////////// GET _analyze { "analyzer": "ik_max_word", "text": ["hello, world~", "hello, ik!"] }

{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "ENGLISH", "position" : 0 }, { "token" : "world", "start_offset" : 7, "end_offset" : 12, "type" : "ENGLISH", "position" : 1 }, { "token" : "hello", "start_offset" : 13, "end_offset" : 18, "type" : "ENGLISH", "position" : 102 }, { "token" : "ik", "start_offset" : 20, "end_offset" : 22, "type" : "ENGLISH", "position" : 103 } ] }

//////////////////////////////////////////////////////////////////////////

GET _analyze { "analyzer": "standard", "text": ["hello, world~", "hello, ik!"] }

{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "", "position" : 0 }, { "token" : "world", "start_offset" : 7, "end_offset" : 12, "type" : "", "position" : 1 }, { "token" : "hello", "start_offset" : 14, "end_offset" : 19, "type" : "", "position" : 102 }, { "token" : "ik", "start_offset" : 21, "end_offset" : 23, "type" : "", "position" : 103 } ] }

////////////////////////////////////////////////////////////////////////// We found that the problem was caused by the end of useless characters. IK analyzer does not calculate end of useless characters and set the wrong offset in "end" funtion.