NaturalNode / natural

general natural language facilities for node
MIT License
10.56k stars 861 forks source link

about chinese #177

Open ahl5esoft opened 10 years ago

ahl5esoft commented 10 years ago

how to use classifier in chinese?

ahl5esoft commented 10 years ago

classifier.addDocument('五裕紫菜片', '干货'); classifier.addDocument('优香岛桂皮', '干货'); classifier.addDocument('苗家辣妹辣椒', '干货'); classifier.addDocument('海博卷尺', '小五金'); classifier.addDocument('三达SD-156A双重过滤烟嘴', '小五金'); classifier.addDocument('波斯BS-I3091测电笔', '小五金'); classifier.train();

classifier.classify('紫菜') => 干货

classifier.classify('双重过滤') => 干货

classifier.classify('波斯') => 干货

why?

kkoch986 commented 10 years ago

The classifier relies on a tokenizer and stemmer so that could be part of the problem, I don't think we have a chinese stemmer at the moment and if you use the english one it will use the english tokenizer which probably wont help much.

This is part of the reason why we need #159, it could help ensure that when a tokenizer is used that its the correct language.

mike820324 commented 9 years ago

I think Chinese Language doesn't need stemming at all, but how to tokenize a Chinese document will be a very painful job. );

smilechun commented 7 years ago

Not sure is it possible, but i tried to applied nodejieba to classification and it seems work.

var nodejieba = require("nodejieba"); var natural = require('natural'), classifier = new natural.BayesClassifier();

classifier.addDocument(nodejieba.cut("红掌拨清波"), 'poem'); classifier.addDocument(nodejieba.cut("想睇戲"), 'action'); classifier.addDocument(nodejieba.cut("南京市长江大桥"), 'place'); classifier.train();

console.log(classifier.classify(nodejieba.cut('红掌拨清波'))); console.log(classifier.classify(nodejieba.cut("想睇戲"))); console.log(classifier.classify(nodejieba.cut('南京市长江大桥睇戲')));

loretoparisi commented 6 years ago

so basically would be possible to add a TokenizerZh by using nodejieba.cut as tokenization function override?

titanism commented 2 years ago

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

Hugo-ter-Doest commented 2 years ago

Will look into this.