Closed rmarronnier closed 5 years ago
That would be an awesome addition. Just keep in mind that we want it to be possible to configure whatever data set is responsible for determining the most likely language.
What we could do is having a Cadmium::Config
module where the DATA_PATH
constants are set and can be overridden by the developer.
That is true, go ahead and run with this. I'm excited to see what you can come up with.
After trying to implement this Cadmium::Config
module, I realized it's a bad idea IMO :
DATA_PATH
constants can't be declared twice so their values can't be overridden as I mistakenly thought. And we can't use the read_file
macro without constants.Cadmium::Tfidf
one while keeping as default value the path to the Cadmium provided data.In this way, developers could use whatever data (stopwords, sentiments, n-grams, etc...) they want.
To get back to the language detector topic, the Franc algorithm is based on comparing the extracted n-grams for each language version of the UDHR. I can't honestly see why a developer would want to use a different dataset that the one gathered by the parent project.
Sounds good. We can forget about the config stuff for now and kick ideas around. I'm actually going to open up a gitter channel so anyone that wishes to can chat about Cadmium. I've just finished with a large project for work and now get to go back to working on my GLOVE implementation for Cadmium.
I'm actually going to open up a gitter channel so anyone that wishes to can chat about Cadmium. Great idea !
now get to go back to working on my GLOVE implementation for Cadmium. I can't wait to see this ! It will open up so many possibilities for Cadmium :smiley: Now, I'm back to my Unicode soup for franca :laughing:
Added language_detector as separate shard
I'm working on a crystal port of franc and once it's working I'd like to merge it in Cadmium if you're ok with this.