cadmiumcr / cadmium

Natural Language Processing (NLP) library for Crystal
https://cadmiumcr.com
MIT License
205 stars 15 forks source link

I18n stopwords #22

Closed rmarronnier closed 5 years ago

rmarronnier commented 5 years ago
watzon commented 5 years ago

My main problem with this is that when you rely on an external json file and try to build a binary it still relies on that external JSON file. I love the idea of being able to configure the stop words though, so we could do one of two things:

A) We could require that stop words be supplied if they're wanted by providing a path to a stop words JSON file. B) We could keep the small stop words list as a hash, but allow them to override it. C) We could load the JSON at compile time and have the English stop words set as the default.

All of them have drawbacks, but I think I like C the most. It's main drawback is that no matter what that large list of stop words is going to end up in the binary, but I am thinking about forcing people to only import the parts of Cadmium that they need anyway, so that would fix that problem.

What do you think?

rmarronnier commented 5 years ago

The problem is that I agree with everything you said : Isn't C) what is implemented in the PR ? It's the same like what we did with the sentiment.txt file : it's loaded at compile time.

I have no immediate solution in mind to avoid loading this data in the binary. It's a drawback sure, but it improves developer experience. 200 kb is not that much.

watzon commented 5 years ago

I'll leave some notes in the code for you :)

rmarronnier commented 5 years ago

Oh, I might a solution :

D) By default, set a path variable to the stopwords.json (I've hard coded it in the PR but I can make it user configurable) et set the @@lang to :en by default (as it is in the PR). And allow the developer to either set a new path for her own stopwords.json or set it to nil, in that case set @@stop_words to an array as it is now in master. Hope it make sense :-)

rmarronnier commented 5 years ago

Well, as you can see in the code :

Do we need these now ?

rmarronnier commented 5 years ago

Done !

I'm not a copy writer so you won't offend me if you need to correct what's written ;-)

rmarronnier commented 5 years ago

Wait before merging this, I want to add specs, fix some things and implement the Cadmium::Config module we talked in another issue.

watzon commented 5 years ago

Wait before merging this, I want to add specs, fix some things and implement the Cadmium::Config module we talked in another issue.

Just let me know when it's ready and I'll test

rmarronnier commented 5 years ago

Great. Everything is ready except the Cadmium::Config module, which I won't implement after thinking hard about it. I'll explain why it's a bad idea in the Language detector issue, so we can move the discussion there :-)