特殊符号ω受空格影响，导致分词结果包含空格

UsvaZhan commented 4 years ago

特殊符号ω受空格影响，导致分词结果包含空格

Hi KennFalcon, 我已查看源码并运行，但未能找到问题根源，如能解惑非常感谢。

基于自定义词典处理物理参数 "27ω"时，符号后存在空格导致结果不同
期望结果为"27ω"

该符号不受停用词配置影响，配置如下

      "tokenizer" : {
        "customer_hanlp_index_tokenizer" : {
          "enable_stop_dictionary" : "true",
          "enable_custom_config" : "true",
          "enable_part_of_speech_tagging" : "false",
          "enable_custom_dictionary_forcing" : "true",
          "enable_number_quantifier_recognize" : "false",
          "enable_remote_dict" : "true",
          "type" : "hanlp_index"
        }
      }

数据参考

正常分词


{
"text":"27ω10v",
"analyzer":"hanlp"
}

{ "tokens": [ { "token": "27ω", "start_offset": 0, "end_offset": 3, "type": "rtrv", "position": 0 }, { "token": "10v", "start_offset": 3, "end_offset": 6, "type": "vol", "position": 1 } ] }

- 失败分词

{ "text":"27ω 10v", "analyzer":"hanlp" }

{ "tokens": [ { "token": "27", "start_offset": 0, "end_offset": 2, "type": "m", "position": 0 }, { "token": "ω ", "start_offset": 2, "end_offset": 4, "type": "w", "position": 1 }, { "token": "10v", "start_offset": 4, "end_offset": 7, "type": "vol", "position": 2 } ] }


4. 普通参数效果

参数： { "text":"@ # ω ", "analyzer":"hanlp" } 结果： { "tokens": [ { "token": "@", "start_offset": 0, "end_offset": 1, "type": "nx", "position": 0 }, { "token": "#", "start_offset": 2, "end_offset": 3, "type": "nx", "position": 1 }, { "token": " ω ", "start_offset": 3, "end_offset": 6, "type": "w", "position": 2 } ] }

UsvaZhan commented 4 years ago

elasticsearch.version=7.4.2

elasticsearch-analysis-hanlp version=7.4.2

lichenhuinn commented 3 years ago

使用自定义用户词典的话可以把源码中的TokenizerBuilder类中 segment方法中的 enableCustomDictionary(configuration.isEnableCustomDictionary()) 修改为enableCustomDictionaryForcing()方法，自定义词典高优先级使用

KennFalcon / elasticsearch-analysis-hanlp

特殊符号ω受空格影响，导致分词结果包含空格 #74

特殊符号ω受空格影响，导致分词结果包含空格