Open sdg002 opened 5 years ago
Which Tokenizor
are you using? RegexTokenizer
or TreebankTokenizer
https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP/Tokenize
Tried with TreebankTokenizer
.
RegexTokenizer
is throwing an ArgumentNull exception. I guess, I am not using it the right way.
Can you run this UnitTest? https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP.UnitTest/Tokenize
The RegexTokenizer
was able to parse "hello-world".
Unfortunately, it also split 50,000
in the sentence this will cost 50,000
into 50
and 000
.
Nevertheless, your efforts are commendable. I think I am asking too much at this moment.
It can be added easily to split digital with commas. You can do it and PR.
这个项目太实用了,但是资料好少啊,我英文也不好,该怎么详细了解一下呢。已经运行成功了,就是不知道怎么该达到我想要的效果
请参考单元测试。
请参考单元测试。
谢谢老大,你的联系方式可以给一个吗,我把单元测试里面的方法都运行了,基本都可以,但是不知道具体实现的是什么功能,英文不好,也大概推测不出来,还有wordvec_enu.bin这个文件,我没下载到。我看了好多nlp的代码,你这个功能最强大,最全,最适合我了。我是着急想全部看通,但是没有文档,我短时间琢磨不透啊。
Hi All, I am comparing the tokenization of the sentence
Hello-world
with other NLP librariesI am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like
Google
andOpenNLP-EnglishRuleBasedTokenizer
?CherubNLP
I get back a single token
Hello-world
OpenNLP
I am using the class
OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer
and this gave me 3 tokensGoogle NLP
https://cloud.google.com/natural-language/ Google gives me 3 tokens.
nltk
nltk WordPunctTokenizer