Tencent / NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Other
1.83k stars 402 forks source link

does Chinese text need word segmentation #72

Closed MrRace closed 4 years ago

MrRace commented 4 years ago

For Chinese corpus, whether to do word segmentation? Is there any example or evaluating results on Chinese corpus? Thanks a lot!

coderbyr commented 4 years ago

It depends on which feature you use. If you chose char as the granularity of feature, you do not need word segmentation. Otherwise you need do segmentation first. We have done experiments on Open Chinese corpus and our datasets, but not presented publicly.

MrRace commented 4 years ago

@coderbyr If do word segmentation, does the feature_names in the train.json should choose "token", otherwise choose "char"?

coderbyr commented 4 years ago

@coderbyr If do word segmentation, does the feature_names in the train.json should choose "token", otherwise choose "char"? Yes. Note that if you choose FastText as your model, you can use multi feature_names, like tokencharkeyword, otherwise you should choose between token and char.

MrRace commented 4 years ago

@coderbyr Thanks a lot for your reply. For FastText , if my train data do segmentation like:

{"doc_keyword": [],
 "doc_topic": [], 
"doc_token": ["明星", "情绪", "失控", "可怕", "炅", "发飙", "青筋", "暴", "成龙", "飞", "脚", "踹", "主持人", "脸"], "doc_label": ["label1","label2"]}

Could I just set ['char', 'token'] in feature_names to use both 'char' and 'token' feature? It seems not work. it seems work the same with just use 'token' feature. If I want to use both 'char' and 'token' feature, what should I do? Should I use char in "doc_token" part in train data? i.e "doc_token": ["明", "星", "情", "绪", "失", "控", "可", "怕", "炅", "发", "飙", "青", "筋", "暴", "成", "龙", "飞", "脚", "踹", "主", "持", "人", "脸"] ?