NaturalNode / natural

general natural language facilities for node
MIT License
10.54k stars 861 forks source link

International Support #92

Open wbashir opened 11 years ago

wbashir commented 11 years ago

Any plans for international support, i guess i am trying to use the tokenizer to parse Arabic words

chrisumbel commented 10 years ago

I'm always looking for people to contribute algorithms pertaining to non-English languages. In the fall I hope to really ramp up this effort, but it will involve getting new people involved with the project.

mef commented 10 years ago

I also would need international support and might contribute to this.

My problem is that the tokenizer skips all accented characters, so a quick fix for me is to update the regex it uses.

@chrisumbel, how would you rather fix this issue ? Perhaps you have another implementation in mind, please let me know.

chrisumbel commented 10 years ago

My plans thus far were to ultimately break the modules up into language folders where applicable. Something like lib/stemmers/en, lib/stemmers/jp, lib/stemmers/fr

Since certain classes of algorithms, like string comparison/distance aren't applicable they would remain as is.

Everything will still reside in the natural project. Make sense or is that silly?

JpEncausse commented 10 years ago

Indeed FR support would be great. It could be a great kick start for a chatterbot. I'd like to test it in a module for SARAH (http://sarah.encausse.net)

Is there a list of project using natural ?

lfilho commented 10 years ago

I'm also interested in international support. Brazillian portuguese here... I need the tokenizer not to skip things like: ã, ó, ê, ç and so on...

My knowledge in NLP is very limited (not familiar with all these terminologies... I only go so far as of "tokenizer" lol), so I would be happy to contribute to this project but I would need some guidance on how to start to contribute... Like, what to I have to touch / modify to get this done?

A very basic tutorial for rookies would be nice. Like: "A stemmer is a thing that does this, a tokenizer does that, a classifier...."

Count me in to help grow this project

kkoch986 commented 10 years ago

@lfilho The tokenizer would be a pretty good place to start, a lot of other pieces rely on that.

Take a look here for a basic idea about tokenizers, in a nutshell the goal is to take some text and produce a list of 'tokens' or words in most NLP cases.

You can see here that we have a few tokenizers in different languages (although if you look here you'll see they may be under covered by unit tests) so they might be a helpful reference when creating your tokenizer.

EDIT: I would start with the agressive tokenizer it doesn't require much modification since its not super language dependent. Also there are some already built for other languages to give you an idea of the naming conventions we're using.

Feel free to ask if you have any questions, -Ken

lfilho commented 10 years ago

So there we go. I did the tokenizer. Since I'm here, I don't think the spanish one is working. It suffers from the same problem I mentioned here with diacritic chars...

I'm also doing a new pull request shortly to add jasmine-node as dev dependecy

deemeetree commented 10 years ago

Hello, do you know how to avoid tokenizer splitting words in foreign languages, so that fußball stays fußball and does not become fu s ball?

kkoch986 commented 10 years ago

@deemeetree answered in #152

Hugo-ter-Doest commented 6 years ago

I think for multilingual support you need to separate logic from content. For the natural library this means that there are algorithms and there are configurations. For instance, most tokenizers depend on the use of regular expressions to split a sentence. Develop one (maybe more are needed) algorithm for tokenization and provide expressions per language in a separate content folder (or repo). If you create a tokenizer you configure the tokenizer with language specific content/rules/etc. Likewise, the Brill POS tagger is already separated in algorithm and transformation rules. In the brill_pos_tagger folder you find a lib folder with the algorithm and a data folder with rules for English and Dutch. Parsers can be done similarly. This approach avoids creating a myriad of language specific code files.

Hugo