Add language detector - Githubissues

rmarronnier commented 5 years ago

I'm working on a crystal port of franc and once it's working I'd like to merge it in Cadmium if you're ok with this.

watzon commented 5 years ago

That would be an awesome addition. Just keep in mind that we want it to be possible to configure whatever data set is responsible for determining the most likely language.

rmarronnier commented 5 years ago

What we could do is having a Cadmium::Config module where the DATA_PATH constants are set and can be overridden by the developer.

watzon commented 5 years ago

That is true, go ahead and run with this. I'm excited to see what you can come up with.

rmarronnier commented 5 years ago

After trying to implement this Cadmium::Config module, I realized it's a bad idea IMO :

DATA_PATH constants can't be declared twice so their values can't be overridden as I mistakenly thought. And we can't use the read_file macro without constants.
We could let those constants without values natively but that would force all users to put boilerplate code in their code ie : STOPWORDS_DATA_PATH = "data/path" The sane solution would be to make Cadmium methods accept a custom data source (as I did with the Cadmium::Tfidf one while keeping as default value the path to the Cadmium provided data.

In this way, developers could use whatever data (stopwords, sentiments, n-grams, etc...) they want.

To get back to the language detector topic, the Franc algorithm is based on comparing the extracted n-grams for each language version of the UDHR. I can't honestly see why a developer would want to use a different dataset that the one gathered by the parent project.

watzon commented 5 years ago

Sounds good. We can forget about the config stuff for now and kick ideas around. I'm actually going to open up a gitter channel so anyone that wishes to can chat about Cadmium. I've just finished with a large project for work and now get to go back to working on my GLOVE implementation for Cadmium.

rmarronnier commented 5 years ago

I'm actually going to open up a gitter channel so anyone that wishes to can chat about Cadmium. Great idea !

now get to go back to working on my GLOVE implementation for Cadmium. I can't wait to see this ! It will open up so many possibilities for Cadmium :smiley: Now, I'm back to my Unicode soup for franca :laughing:

rmarronnier commented 5 years ago

Added language_detector as separate shard

cadmiumcr / cadmium

Add language detector #23