NaturalNode / natural

general natural language facilities for node
MIT License
10.6k stars 860 forks source link

TfIdf can use chinese ? #212

Open c941010623 opened 9 years ago

c941010623 commented 9 years ago

my code is :

var natural = require('natural'), TfIdf = natural.TfIdf, tfidf = new TfIdf();

tfidf.addDocument('中文測試', 's1'); var s = JSON.stringify(tfidf); console.log(s)

kkoch986 commented 9 years ago

I havent personally tried tfidf with chinese, first glance it doesnt seem to work.

You probably need to change the tokenizer but i dont think we have a chinese tokenizer yet. I'll leave this open for awhile and see if anyone else has any experience with tfidf this way.

mike820324 commented 9 years ago

Is it possible to use something like the following libraries to first tokenize the Chinese sentences or document? https://github.com/dotSlashLu/nodescws https://github.com/yanyiwu/nodejieba

which should separate a Chinese sentences into several Chinese tokens.

for example the following string "中文測試", which means "Chinese test", will become the following list, ["中文", "測試"], which means ["Chinese", "test"].

dcsan commented 7 years ago

@mike820324 did you get any further with this? I'm also using nodejieba on some Chinese NLP projects, but not sure if i should move to python for the project for NLTK etc.

anton-bot commented 6 years ago

I have no problem with the Chinese tokenizer, but the code still doesn't work. When I checked listTerms(), it assigns the tfidf of zero to all terms:

我: 0 搵: 0 緊: 0 游泳池: 0 你們: 0 喺邊度: 0

Is this a problem? How to fix that?

dcsan commented 6 years ago

what do the two values mean in list of terms? Is this a basic frequency or the inverse frequency related to the text?

FWIW frequency word lists are a mixed bag for chinese. I think that Jieba has its own built in, which while trained on not the most representative material, would at least match the same tokens...

titanism commented 2 years ago

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.