KennFalcon / elasticsearch-analysis-hanlp

HanLP Analyzer for Elasticsearch
Apache License 2.0
830 stars 225 forks source link

特殊符号ω受空格影响,导致分词结果包含空格 #74

Open UsvaZhan opened 4 years ago

UsvaZhan commented 4 years ago

特殊符号ω受空格影响,导致分词结果包含空格

Hi KennFalcon, 我已查看源码并运行,但未能找到问题根源,如能解惑非常感谢。

  1. 基于自定义词典处理物理参数 "27ω"时,符号后存在空格导致结果不同
  2. 期望结果为"27ω"
  3. 该符号不受停用词配置影响,配置如下
          "tokenizer" : {
            "customer_hanlp_index_tokenizer" : {
              "enable_stop_dictionary" : "true",
              "enable_custom_config" : "true",
              "enable_part_of_speech_tagging" : "false",
              "enable_custom_dictionary_forcing" : "true",
              "enable_number_quantifier_recognize" : "false",
              "enable_remote_dict" : "true",
              "type" : "hanlp_index"
            }
          }
  4. 数据参考
    • 正常分词
      
      {
      "text":"27ω10v",
      "analyzer":"hanlp"
      }

{ "tokens": [ { "token": "27ω", "start_offset": 0, "end_offset": 3, "type": "rtrv", "position": 0 }, { "token": "10v", "start_offset": 3, "end_offset": 6, "type": "vol", "position": 1 } ] }

- 失败分词

{ "text":"27ω 10v", "analyzer":"hanlp" }

{ "tokens": [ { "token": "27", "start_offset": 0, "end_offset": 2, "type": "m", "position": 0 }, { "token": "ω ", "start_offset": 2, "end_offset": 4, "type": "w", "position": 1 }, { "token": "10v", "start_offset": 4, "end_offset": 7, "type": "vol", "position": 2 } ] }


4. 普通参数效果

参数: { "text":"@ # ω ", "analyzer":"hanlp" } 结果: { "tokens": [ { "token": "@", "start_offset": 0, "end_offset": 1, "type": "nx", "position": 0 }, { "token": "#", "start_offset": 2, "end_offset": 3, "type": "nx", "position": 1 }, { "token": " ω ", "start_offset": 3, "end_offset": 6, "type": "w", "position": 2 } ] }

UsvaZhan commented 4 years ago

elasticsearch.version=7.4.2

elasticsearch-analysis-hanlp version=7.4.2

lichenhuinn commented 3 years ago

使用自定义用户词典的话可以把源码中的TokenizerBuilder类中 segment方法中的 enableCustomDictionary(configuration.isEnableCustomDictionary()) 修改为enableCustomDictionaryForcing()方法,自定义词典高优先级使用