cadmiumcr / cadmium

Natural Language Processing (NLP) library for Crystal
https://cadmiumcr.com
MIT License
205 stars 15 forks source link

Add language detector #23

Closed rmarronnier closed 5 years ago

rmarronnier commented 5 years ago

I'm working on a crystal port of franc and once it's working I'd like to merge it in Cadmium if you're ok with this.

watzon commented 5 years ago

That would be an awesome addition. Just keep in mind that we want it to be possible to configure whatever data set is responsible for determining the most likely language.

rmarronnier commented 5 years ago

What we could do is having a Cadmium::Config module where the DATA_PATH constants are set and can be overridden by the developer.

watzon commented 5 years ago

That is true, go ahead and run with this. I'm excited to see what you can come up with.

rmarronnier commented 5 years ago

After trying to implement this Cadmium::Config module, I realized it's a bad idea IMO :

In this way, developers could use whatever data (stopwords, sentiments, n-grams, etc...) they want.

To get back to the language detector topic, the Franc algorithm is based on comparing the extracted n-grams for each language version of the UDHR. I can't honestly see why a developer would want to use a different dataset that the one gathered by the parent project.

watzon commented 5 years ago

Sounds good. We can forget about the config stuff for now and kick ideas around. I'm actually going to open up a gitter channel so anyone that wishes to can chat about Cadmium. I've just finished with a large project for work and now get to go back to working on my GLOVE implementation for Cadmium.

rmarronnier commented 5 years ago

I'm actually going to open up a gitter channel so anyone that wishes to can chat about Cadmium. Great idea !

now get to go back to working on my GLOVE implementation for Cadmium. I can't wait to see this ! It will open up so many possibilities for Cadmium :smiley: Now, I'm back to my Unicode soup for franca :laughing:

rmarronnier commented 5 years ago

Added language_detector as separate shard