SciSharp / CherubNLP

Natural Language Processing in .NET Core
Apache License 2.0
114 stars 32 forks source link

Tokenizng "Hello-world" #2

Open sdg002 opened 5 years ago

sdg002 commented 5 years ago

Hi All, I am comparing the tokenization of the sentence Hello-world with other NLP libraries

  1. OpenNLP
  2. Google Natural Language (Cloud)
  3. nltk(default)
  4. nltk(WordPunctTokenizer)

I am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like Google and OpenNLP-EnglishRuleBasedTokenizer ?

CherubNLP

I get back a single token Hello-world

OpenNLP

I am using the class OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer and this gave me 3 tokens

Google NLP

https://cloud.google.com/natural-language/ Google gives me 3 tokens.

GoogleCloud

nltk

nltk.word_tokenize("Hello-world")
['Hello-world']

nltk WordPunctTokenizer

nltk.tokenize.WordPunctTokenizer().tokenize("Hello-world")
['Hello', '-', 'world']
Oceania2018 commented 5 years ago

Which Tokenizor are you using? RegexTokenizer or TreebankTokenizer https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP/Tokenize

sdg002 commented 5 years ago

Tried with TreebankTokenizer. RegexTokenizer is throwing an ArgumentNull exception. I guess, I am not using it the right way.

Oceania2018 commented 5 years ago

Can you run this UnitTest? https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP.UnitTest/Tokenize

sdg002 commented 5 years ago

The RegexTokenizer was able to parse "hello-world".

Unfortunately, it also split 50,000 in the sentence this will cost 50,000 into 50 and 000.

Nevertheless, your efforts are commendable. I think I am asking too much at this moment.

Oceania2018 commented 5 years ago

It can be added easily to split digital with commas. You can do it and PR.

13653415686 commented 4 years ago

这个项目太实用了,但是资料好少啊,我英文也不好,该怎么详细了解一下呢。已经运行成功了,就是不知道怎么该达到我想要的效果

Oceania2018 commented 4 years ago

请参考单元测试。

13653415686 commented 4 years ago

请参考单元测试。

谢谢老大,你的联系方式可以给一个吗,我把单元测试里面的方法都运行了,基本都可以,但是不知道具体实现的是什么功能,英文不好,也大概推测不出来,还有wordvec_enu.bin这个文件,我没下载到。我看了好多nlp的代码,你这个功能最强大,最全,最适合我了。我是着急想全部看通,但是没有文档,我短时间琢磨不透啊。