Closed MrRace closed 4 years ago
It depends on which feature you use. If you chose char as the granularity of feature, you do not need word segmentation. Otherwise you need do segmentation first. We have done experiments on Open Chinese corpus and our datasets, but not presented publicly.
@coderbyr If do word segmentation, does the feature_names
in the train.json
should choose "token", otherwise choose "char"?
@coderbyr If do word segmentation, does the
feature_names
in thetrain.json
should choose "token", otherwise choose "char"? Yes. Note that if you chooseFastText
as your model, you can use multifeature_names
, liketoken
、char
、keyword
, otherwise you should choose betweentoken
andchar
.
@coderbyr Thanks a lot for your reply. For FastText , if my train data do segmentation like:
{"doc_keyword": [],
"doc_topic": [],
"doc_token": ["明星", "情绪", "失控", "可怕", "炅", "发飙", "青筋", "暴", "成龙", "飞", "脚", "踹", "主持人", "脸"], "doc_label": ["label1","label2"]}
Could I just set ['char', 'token'] in feature_names
to use both 'char' and 'token' feature? It seems not work. it seems work the same with just use 'token' feature. If I want to use both 'char' and 'token' feature, what should I do? Should I use char
in "doc_token" part in train data? i.e "doc_token": ["明", "星", "情", "绪", "失", "控", "可", "怕", "炅", "发", "飙", "青", "筋", "暴", "成", "龙", "飞", "脚", "踹", "主", "持", "人", "脸"] ?
For Chinese corpus, whether to do word segmentation? Is there any example or evaluating results on Chinese corpus? Thanks a lot!