Open UsvaZhan opened 4 years ago
elasticsearch.version=7.4.2
elasticsearch-analysis-hanlp version=7.4.2
使用自定义用户词典的话可以把源码中的TokenizerBuilder类中 segment方法中的 enableCustomDictionary(configuration.isEnableCustomDictionary()) 修改为enableCustomDictionaryForcing()方法,自定义词典高优先级使用
特殊符号ω受空格影响,导致分词结果包含空格
Hi KennFalcon, 我已查看源码并运行,但未能找到问题根源,如能解惑非常感谢。
{ "tokens": [ { "token": "27ω", "start_offset": 0, "end_offset": 3, "type": "rtrv", "position": 0 }, { "token": "10v", "start_offset": 3, "end_offset": 6, "type": "vol", "position": 1 } ] }
{ "text":"27ω 10v", "analyzer":"hanlp" }
{ "tokens": [ { "token": "27", "start_offset": 0, "end_offset": 2, "type": "m", "position": 0 }, { "token": "ω ", "start_offset": 2, "end_offset": 4, "type": "w", "position": 1 }, { "token": "10v", "start_offset": 4, "end_offset": 7, "type": "vol", "position": 2 } ] }
参数: { "text":"@ # ω ", "analyzer":"hanlp" } 结果: { "tokens": [ { "token": "@", "start_offset": 0, "end_offset": 1, "type": "nx", "position": 0 }, { "token": "#", "start_offset": 2, "end_offset": 3, "type": "nx", "position": 1 }, { "token": " ω ", "start_offset": 3, "end_offset": 6, "type": "w", "position": 2 } ] }