How does add jieba custom dictionary? - Githubissues

crownpku / Rasa_NLU_Chi

Turn Chinese natural language into structured data 中文自然语言理解

Apache License 2.0

1.51k stars 422 forks source link

How does add jieba custom dictionary? #4

Open yahuvi opened 7 years ago

yahuvi commented 7 years ago

I want to add jieba custom dictionary, which config file can do it？

crownpku commented 7 years ago

Hi, you may refer to the following instructions from jieba, and add the corresponding code with your own dictionary in https://github.com/crownpku/rasa_nlu_chi/blob/master/rasa_nlu/tokenizers/jieba_tokenizer.py

def tokenize(self, text):
        # type: (Text) -> List[Token]
        import jieba
        #MODIFICATION
        jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径
        #MODIFICATION ENDS
        words = jieba.lcut(text.encode('utf-8'))

From jieba:

载入词典

开发者可以指定自己自定义的词典，以便包含 jieba 词库里没有的词。虽然 jieba 有新词识别能力，但是自行添加新词可以保证更高的正确率用法： jieba.load_userdict(file_name) # file_name 为文件类对象或自定义词典的路径词典格式和 dict.txt 一样，一个词占一行；每一行分三部分：词语、词频（可省略）、词性（可省略），用空格隔开，顺序不可颠倒。file_name 若为路径或二进制方式打开的文件，则文件必须为 UTF-8 编码。词频省略时使用自动计算的能保证分出该词的词频。

BrikerMan commented 6 years ago

能提供一个从配置文件加载的方法么？谢谢。

crownpku commented 6 years ago

@BrikerMan 最新的commit增加了配置文件加载的方法。请把你的jieba userdic的file path加到sample_configs/config_jieba_mitie_sklearn.json的配置文件中。

BrikerMan commented 6 years ago

这个项目还没有跟官方的合并是吧？那我就得在这个下面写我的业务，不能直接 pip 安装 rasa nlu 实现对么。

crownpku commented 6 years ago

@BrikerMan rasa_nlu_chi本身一直在update rasa_nlu最新的代码。现在不能merge进官方仓库的原因是rasa_nlu主框架的language control部分还有问题，长远还是要作为language support合并进去。

中文业务的话，暂时可能还是要用rasa_nlu_chi.

BrikerMan commented 6 years ago

嗯嗯。那就先用这个了。非常感谢。我在继续研究研究。

BrikerMan commented 6 years ago

遇到个错误。配置文件加载没问题，已经找到训练数据。

Traceback (most recent call last):
  File "train.py", line 21, in <module>
    trainer.train(training_data)
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/model.py", line 157, in train
    updates = component.train(working_data, self.config, **context)
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 37, in train
    example.set("tokens", self.tokenize(example.text))
  File "/Users/brikerman/Desktop/ailab/rasa-related/Rasa_NLU_Chi/rasa_nlu/tokenizers/jieba_tokenizer.py", line 49, in tokenize
    if config['jieba_userdic'] != 'None':
NameError: name 'config' is not defined

BrikerMan commented 6 years ago

原因是 tokenize 方法没有 config 属性，而且也不能每次 tokenize 时候加载一次字典。加到 train 方法里面了，这样能正常跑，不过也不合理。应该在 tokenizer 初始化时候进行加载。

    def train(self, training_data, config, **kwargs):
        # type: (TrainingData, RasaNLUConfig, **Any) -> None
        if config['language'] != 'zh':
            raise Exception("tokenizer_jieba is only used for Chinese. Check your configure json file.")
            # Add jieba userdict file
        if config['jieba_userdic'] != 'None':
            jieba.load_userdict(config['jieba_userdic'])
        for example in training_data.training_examples:
            example.set("tokens", self.tokenize(example.text))

crownpku commented 6 years ago

init部分好像不好加config，牵扯到整个tokenizer的init定义。貌似最简单的方法就是放去train里面，在处理training_data的时候load jieba userdict。不合理的话，是指每次train的时候都要load一次吗？但用户全量数据训练也是一次完整的流程，每次load一次userdict好像也没有什么问题。我先这个方法把代码改好吧。

BrikerMan commented 6 years ago

这个不合理是， train 时候我加载了词典，但是预测时候不会走这里。导致我训练和预测的分词不一样。每次 train 加载一次全量的字典这个倒是没问题。

crownpku commented 6 years ago

@BrikerMan 我把import jieba从tokenizer拿出来了，防止每次运行tokenizer都要跑import。然后把load jieba userdict放去了train函数里。如果你有更好的实现方法，欢迎修改代码发pull request。

crownpku commented 6 years ago

@BrikerMan 明白你的意思了，inference确实是有问题。我想下怎么搞。

BrikerMan commented 6 years ago

如果我自己在项目里面自定义了 pipeline，如何注册？我用 pip 方式安装了 rasa-nlu-chi。看了自定义 pipeline 要修改 rasa_nlu.registry.py 文件。如何能够不改变 resa 源文件的情况下加载自定义 pipeline ？

crownpku commented 6 years ago

自定义pipeline只需要修改config文件就好了

{
  "name": "rasa_nlu_test",
  "pipeline": ["nlp_mitie",
        "tokenizer_jieba",
        "ner_mitie",
        "ner_synonyms",
        "intent_entity_featurizer_regex",
        "intent_featurizer_mitie",
        "intent_classifier_sklearn"],
  "language": "zh",
  "mitie_file": "./data/total_word_feature_extractor_zh.dat",
  "path" : "./models",
  "data" : "./data/examples/rasa/demo-rasa_zh.json",
  "jieba_userdic": "None"
}

你是要添加新的module吗还是？

BrikerMan commented 6 years ago

我添加到这里以后提示

If you are creating your own component, make sure it is either listed as part of the component_classes in rasa_nlu.registry.py or is a proper name of a class in a module.

好像是需要注册一下这个 class 否则不知道从哪里 import 这个。我想注册一个大写汉字数字转阿拉伯数字的组件。

crownpku commented 6 years ago

新的组件是需要注册的。你可以以jieba_tokenizer为例，在项目中搜索下相关部分代码。

BrikerMan commented 6 years ago

嗯嗯，这个我看到了。就是想的有没有办法在不修改 rasa 代码情况下注册。

crownpku commented 6 years ago

关于加入jieba自定义词典，暂时没有找到非常优雅的做法。现在(20171116)的版本，需要用户把jieba自定义词典放到rasa_nlu_chi/jieba_userdict/下面。系统在训练和预测时都会自动寻找并导入jieba分词。

DoubleAix commented 6 years ago

關於加入jieba字典的方法，我有一些疑問因為使用python setup.py install ，把它安裝在site-packages裡面 unzip -l rasa_nlu-0.12.0a1-py3.6.egg | grep jieba 1665 03-12-2018 15:44 rasa_nlu/tokenizers/jieba_tokenizer.py 2200 03-12-2018 16:22 rasa_nlu/tokenizers/pycache/jieba_tokenizer.cpython-36.pyc 我rasa_nlu是引用這個位置的package 而非git clone https://github.com/crownpku/Rasa_NLU_Chi.git 目錄下的package

所以我專案目錄下，執行python -m rasa_nlu.train -c sample_configs/config_jieba_mitie_sklearn.json 依據你的源碼 import glob import jieba jieba_userdicts = glob.glob("./jieba_userdict/*") for jieba_userdict in jieba_userdicts: jieba.load_userdict(jieba_userdict)

是不在這個專案目錄下，要有jieba_userdict這個目錄，才能把字典放進去呢？

我覺得這個字典載入進去jieba程式，最好有提示(console)確定有載入，感覺有機會大家其實都沒有載入

不知道這樣我的理解有沒有錯？

謝謝！！

crownpku commented 6 years ago

@DoubleAix 谢谢你的提示，已更新代码和readme。如果你有更好的添加用户字典的方式也欢迎提出来。