Tokenizer Internationalization - Spanish

clusterfudge commented 8 years ago

We should test to see if the EnglishTokenizer impl is sufficient for Spanish, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

ghost commented 8 years ago

What is needed in order to test it? I am not familiar with adapt's design... and I am reading the README.md at this moment... should I translate the strings or something else is needed?

seanfitzgeraldsc commented 8 years ago

First, you'll need to validate whether or not the EnglishTokenizer is sufficient. I would do this by creating spanish versions of the examples and playing with them. Specifically, the tokenizer is punctuation aware and splits an utterance (sentence or phrase) into individual tokens (usually words).

If the english tokenizer does not work well, you'll need to look for an equivalent to the Porter Stemmer algorithm for Spanish and implement it. The latter can be picked up by someone else, if that's beyond your scope. Validating whether or not the existing tokenizer is sufficient is a great first step.

Thanks!

ghost commented 8 years ago

I see, I am willing to do this, I can't at this very moment... but I will do some experiments later. Expect to read many questions because it's very likely I am getting lost!

cheers!

mcicolella commented 8 years ago

Hi, if you need help to reimplement the Porter Stemmer algorithm for Spanish or other languages take a look at https://github.com/OleanderSoftware/OleanderStemmingLibrary It's a very good lib.

adocampo commented 8 years ago

I do not know if I did what's is supposed to do, but I've just modified the source code of the multi_intent_parser.py to "understand" spanish words. http://pastebin.com/bEJqCKuj

You can try those sentences: "pon algo de música de los clash", "quiero escuchar algo de música de los clash", "qué tiempo hace en seattle", and it seems it returns a JSON.

That's whats its needed?

clusterfudge commented 8 years ago

So, this is definitely some helpful work! I think we'd want to have samples per language, maybe separated by folders. To really verify that this stuff works for spanish, we'd need the unit tests translated to spanish, and even better, localization work done on the unit tests so that the language stays the same, but they load different data files for different languages. That would give me high confidence that the language itself works with the tokenizer, but that may be an unrealistic goal. Can you try translating some of the engine tests?

clusterfudge commented 8 years ago

thanks for contributing!

adocampo commented 8 years ago

Can you try translating some of the engine tests?

Of course I can... could you please point me to the engines? I only saw this one https://github.com/MycroftAI/adapt/blob/master/test/IntentEngineTest.py and I doubt I can do something with it...

clusterfudge commented 8 years ago

That would be the test I was referencing. Swapping out the vocabulary/utterances for spanish equivalents would be acceptable to me, but completely unverifiable (as I only took about 2 years of spanish, 20 years ago).

adocampo commented 8 years ago

Ok, I only translated the utterance sentence (line 36) and the two expressions "tree" (line 34) and "house" (line 43) http://pastebin.com/PkZJ4Gmq

I don't know if this is what you need, and perhaps the utterance sentence can be translated into spanish different depending if it is imperative (as I've translated it), infinitive or other tense...

Hope it helps!

drawveloper commented 8 years ago

Should I open a new issue for Portuguese?

MycroftAI / adapt

Tokenizer Internationalization - Spanish #5