BraveyJS / Bravey

A simple JavaScript NLP-like library to help you creating your own bot.
https://braveyjs.github.io/
MIT License
78 stars 6 forks source link

Improve the documentation on how to add new languages #8

Open thiagodp opened 7 years ago

thiagodp commented 7 years ago

It would be great to have more information about the functions inside en.js and it.js, including and their purposes.

BTW, I'm interested on creating a version for brazilian portuguese (pt-br.js).

thiagodp commented 7 years ago

Hi @BraveyJS, @vidiemme-brainy, and @serafinomb,

Maybe an interesting possibility is to use a port from Snowball stemmers, jssnowball - specifically, the implemented in snowball.babel.js.

It currently supports the following languages (according to an example here):

I observed that the library can be used like this:

function stem( lang, word ) {
    var stemmer = snowballFactory.newStemmer( lang );
    return stemmer.stem( word );
}
console.log( stem( 'portuguese', 'bocado' ) ); // prints 'boc'

What do you think?

BraveyJS commented 7 years ago

We already used Snowball stemmers for the first two languages, so we can do the same with the others as well. Stemmers only improves intent recognizing - entity recognizers are very important in many chatbot scenarios and should be made sometime - but this should work as initial stubs anyway.

You can start putting together the brazilian portuguese using src/languages/en.js as example, with something like this:

/**
 * Brazilian Portuguese language functions.
 * @namespace
 */
Bravey.Language.PT_BR = {};

/**
 * Creates a brazilian portuguese words stemmer (i.e. stemmed version of "bocado" and variants is always "boc").
 * @constructor
 */
Bravey.Language.PT_BR.Stemmer = function(word) {
  return stemmedWord;
}

You can both extract the stemming code from snowball.babel.js and nest it in your stemmer or include snowball.babel.js as dependence and call it. In your own project you can choose both ways but I suggest you the first one for being included in the Bravey core. That's the same way we used for Italian and English, since this way you can make your own Bravey build with the languages you want to support in your project only, in order to optimize JS size and memory usage.

You can start creating your class src/languages/pt-br.js, include it in the tests file src/unit.html like the other languages and write unit tests using your stemmer stand alone and together with NLPs objects and basic entity recognizers (the one you can find in src/entityrecognizers). I'll update this Issue and the Wiki accordingly with the informations you need along the time in order to help others.

Since that's the first language we are adding a new language to the initial release, I suggest you to work on a single language and then - if you want or need it - adding the others gradually but, obviously, feel free to do the same with the other languages as well whenever you want :)

thiagodp commented 7 years ago

Hi @BraveyJS ,

unfortunately the code from snoball.babel.js is a bit cryptic. I think it would be much easier if Bravey could make a stemmers/Stemmer.js with a Stemmer object containing a method like stem( lang, word ). In this way, each language namespace could just define its own stemmer:

Bravey.Language.PT.Stemmer = (function() {
  return function(word) {
    return Bravey.Stemmer.stem( 'portuguese', word );
  }
})();

I tried to make a pt.js (attached, untested) according to en.js and it.js. However, as I said before, it would be nice for new developers just worry about defining the EntityRecognizers. Don't you think?

BraveyJS commented 7 years ago

I agree with you on leaving EntityRecognizer only to developers but, as you can see, a little effort is needed to split the various part in its language package.

I know that Snowball is cryptic and it's easier to include a Stemmer object that works like Snowball in Bravey but we'd like to keep the same design and decision to the other languages and keep the optimization we've planned - and we will use in our projects.

Anyway, you can still work on your language support module, leaving the stemmer returning the word argument as-is as stub and complete the entity recognizers, which are very precious and that can be tested well by someone that knows that language, writing the unit tests as I suggested.

We will try to add the stemmer from Snowball to your language and the others ASAP.

Just a closing note about the file you attached: consider that sentences are cleaned before being processed by entity recognizers, so you can skip language-specific accents in regular expressions. That can help improving entity recognizing with people that doesn't use accents while writing on mobile - or foreign people.

thiagodp commented 7 years ago

I understand your concerns about the js size and the memory use, although they can make Bravey less friendly for developers to include new languages.

Do you know a port from Snowball less cryptic than snowball.babel.js? I would like to include a portuguese stemmer inside pt.js, as you suggested. I'll also try to create some tests for the EntityRecognizers.

I think the sentences could have the graphic accentuation removed, but - as I had suggested in the Issue #6 - it's better not transforming them to lowercase, because there is the need for case-sensitive rules, like those for extracting people names, place names, and the like.

thiagodp commented 7 years ago

Good source for stemmers here: https://github.com/snowballstem/snowball-website/tree/master/js

thiagodp commented 7 years ago

Portuguese version done. Later on I can contribute to improve the docs on how to add a new language, if you want.

BraveyJS commented 7 years ago

Thank you for your contribution! Together with your work, I've just added the portuguese sample you provided in the documentation and with others samples in unit tests. Feel free to improve the docs (I've seen that you've worked on a Microsoft environment due to the new batch builder files ;) ) and, if you want, you can port the two multilingual sample chatbots to portuguese too. You can find localized files in samples/browser/chatterbox/data/medbot.en.js and samples/browser/chatterbox/data/prices.en.js.

thiagodp commented 7 years ago

Here they are ;)

BraveyJS commented 6 years ago

Added! Thanks! Feel free to test and tune up the portuguese examples.

thiagodp commented 6 years ago

Okay!