Open c941010623 opened 9 years ago
I havent personally tried tfidf with chinese, first glance it doesnt seem to work.
You probably need to change the tokenizer but i dont think we have a chinese tokenizer yet. I'll leave this open for awhile and see if anyone else has any experience with tfidf this way.
Is it possible to use something like the following libraries to first tokenize the Chinese sentences or document? https://github.com/dotSlashLu/nodescws https://github.com/yanyiwu/nodejieba
which should separate a Chinese sentences into several Chinese tokens.
for example the following string "中文測試", which means "Chinese test", will become the following list, ["中文", "測試"], which means ["Chinese", "test"].
@mike820324 did you get any further with this? I'm also using nodejieba on some Chinese NLP projects, but not sure if i should move to python for the project for NLTK etc.
I have no problem with the Chinese tokenizer, but the code still doesn't work. When I checked listTerms()
, it assigns the tfidf of zero to all terms:
我: 0 搵: 0 緊: 0 游泳池: 0 你們: 0 喺邊度: 0
Is this a problem? How to fix that?
what do the two values mean in list of terms? Is this a basic frequency or the inverse frequency related to the text?
FWIW frequency word lists are a mixed bag for chinese. I think that Jieba has its own built in, which while trained on not the most representative material, would at least match the same tokens...
You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.
my code is :
var natural = require('natural'), TfIdf = natural.TfIdf, tfidf = new TfIdf();
tfidf.addDocument('中文測試', 's1'); var s = JSON.stringify(tfidf); console.log(s)