Open wbashir opened 11 years ago
I'm always looking for people to contribute algorithms pertaining to non-English languages. In the fall I hope to really ramp up this effort, but it will involve getting new people involved with the project.
I also would need international support and might contribute to this.
My problem is that the tokenizer skips all accented characters, so a quick fix for me is to update the regex it uses.
@chrisumbel, how would you rather fix this issue ? Perhaps you have another implementation in mind, please let me know.
My plans thus far were to ultimately break the modules up into language folders where applicable. Something like lib/stemmers/en, lib/stemmers/jp, lib/stemmers/fr
Since certain classes of algorithms, like string comparison/distance aren't applicable they would remain as is.
Everything will still reside in the natural project. Make sense or is that silly?
Indeed FR support would be great. It could be a great kick start for a chatterbot. I'd like to test it in a module for SARAH (http://sarah.encausse.net)
Is there a list of project using natural ?
I'm also interested in international support. Brazillian portuguese here... I need the tokenizer not to skip things like: ã
, ó
, ê
, ç
and so on...
My knowledge in NLP is very limited (not familiar with all these terminologies... I only go so far as of "tokenizer" lol), so I would be happy to contribute to this project but I would need some guidance on how to start to contribute... Like, what to I have to touch / modify to get this done?
A very basic tutorial for rookies would be nice. Like: "A stemmer is a thing that does this, a tokenizer does that, a classifier...."
Count me in to help grow this project
@lfilho The tokenizer would be a pretty good place to start, a lot of other pieces rely on that.
Take a look here for a basic idea about tokenizers, in a nutshell the goal is to take some text and produce a list of 'tokens' or words in most NLP cases.
You can see here that we have a few tokenizers in different languages (although if you look here you'll see they may be under covered by unit tests) so they might be a helpful reference when creating your tokenizer.
EDIT: I would start with the agressive tokenizer it doesn't require much modification since its not super language dependent. Also there are some already built for other languages to give you an idea of the naming conventions we're using.
Feel free to ask if you have any questions, -Ken
So there we go. I did the tokenizer. Since I'm here, I don't think the spanish one is working. It suffers from the same problem I mentioned here with diacritic chars...
I'm also doing a new pull request shortly to add jasmine-node as dev dependecy
Hello, do you know how to avoid tokenizer splitting words in foreign languages, so that fußball stays fußball and does not become fu s ball?
@deemeetree answered in #152
I think for multilingual support you need to separate logic from content. For the natural library this means that there are algorithms and there are configurations. For instance, most tokenizers depend on the use of regular expressions to split a sentence. Develop one (maybe more are needed) algorithm for tokenization and provide expressions per language in a separate content folder (or repo). If you create a tokenizer you configure the tokenizer with language specific content/rules/etc. Likewise, the Brill POS tagger is already separated in algorithm and transformation rules. In the brill_pos_tagger folder you find a lib folder with the algorithm and a data folder with rules for English and Dutch. Parsers can be done similarly. This approach avoids creating a myriad of language specific code files.
Hugo
Any plans for international support, i guess i am trying to use the tokenizer to parse Arabic words