Tokenizng "Hello-world"

SciSharp / CherubNLP

Natural Language Processing in .NET Core

Apache License 2.0

114 stars 32 forks source link

Tokenizng "Hello-world" #2

Open sdg002 opened 5 years ago

sdg002 commented 5 years ago

Hi All, I am comparing the tokenization of the sentence Hello-world with other NLP libraries

OpenNLP
Google Natural Language (Cloud)
nltk(default)
nltk(WordPunctTokenizer)

I am just trying to get to know more about CherubNLP and the approach it follows. Is there any parameter that would make CherubNLP emit 3 tokens , like Google and OpenNLP-EnglishRuleBasedTokenizer ?

CherubNLP

I get back a single token Hello-world

OpenNLP

I am using the class OpenNLP.Tools.Tokenize.EnglishRuleBasedTokenizer and this gave me 3 tokens

Hello
"-"
world

Google NLP

https://cloud.google.com/natural-language/ Google gives me 3 tokens.

GoogleCloud

nltk

nltk.word_tokenize("Hello-world")
['Hello-world']

nltk WordPunctTokenizer

nltk.tokenize.WordPunctTokenizer().tokenize("Hello-world")
['Hello', '-', 'world']

Oceania2018 commented 5 years ago

Which Tokenizor are you using? RegexTokenizer or TreebankTokenizer https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP/Tokenize

sdg002 commented 5 years ago

Tried with TreebankTokenizer. RegexTokenizer is throwing an ArgumentNull exception. I guess, I am not using it the right way.

Oceania2018 commented 5 years ago

Can you run this UnitTest? https://github.com/SciSharp/CherubNLP/tree/master/CherubNLP.UnitTest/Tokenize

sdg002 commented 5 years ago

The RegexTokenizer was able to parse "hello-world".

Unfortunately, it also split 50,000 in the sentence this will cost 50,000 into 50 and 000.

Nevertheless, your efforts are commendable. I think I am asking too much at this moment.

Oceania2018 commented 5 years ago

It can be added easily to split digital with commas. You can do it and PR.

13653415686 commented 4 years ago

这个项目太实用了，但是资料好少啊，我英文也不好，该怎么详细了解一下呢。已经运行成功了，就是不知道怎么该达到我想要的效果

Oceania2018 commented 4 years ago

请参考单元测试。

13653415686 commented 4 years ago

请参考单元测试。

谢谢老大，你的联系方式可以给一个吗，我把单元测试里面的方法都运行了，基本都可以，但是不知道具体实现的是什么功能，英文不好，也大概推测不出来，还有wordvec_enu.bin这个文件，我没下载到。我看了好多nlp的代码，你这个功能最强大，最全，最适合我了。我是着急想全部看通，但是没有文档，我短时间琢磨不透啊。