chnm / serendipomatic

http://serendipomatic.org/
26 stars 9 forks source link

Fix fr stopwords #132

Closed mbwolff closed 11 years ago

mbwolff commented 11 years ago

When using Serendipomatic with French texts, the common words "les" and "a" are not filtered as stop words. I made a quick fix for this.

mialondon commented 11 years ago

Hello, and thanks for your contribution! Just in case you didn't see, there's a bunch of work to do for multilingual texts discussed at https://github.com/chnm/serendipomatic/issues/114

rlskoeser commented 11 years ago

Wow, I'm surprised nltk stopwords don't include those. I wonder if we're supposed to be stemming terms first? Although I guess that wouldn't help at all for a...

The two changes are redundant, right? We should only need to add the words in one place or the other? Although technically I probably shouldn't have the nltk stopwords checked into the git repo-- I couldn't figure out how to get them deployed on heroku without doing that. @mialondon any thoughts?

@mbwolff when I have the time, I will look into merging this in and see what I can do to set it up to be more extensible (e.g., maybe we need to start our own lists of extra stopwords not handled by nltk, so we can add more terms as we discover them).

mbwolff commented 11 years ago

Yes, I was thinking the same thing. We can use local lists of stopwords and edit as necessary.

mialondon commented 11 years ago

I'm wondering if there's a reason why the nltk library doesn't include them - presumably they've thought through these issues? It'd be good to check in with them in case they're not doing it for a hard-won reason, and perhaps either contribute to theirs, or as you've suggested, supplement it with a local version. A version editable as a plain text file would be the easiest way to have a range of contributors.

I've been wondering about stemming... I suppose it depends how ruthlessly it's applied. e.g. if 'policing' is stemmed to 'police' it brings in extra, unrelated concepts (to use an example from another historian I know).

I've only had a bit of a play with deploying to heroku (I set up an instance to play with the code) so I'm not sure about libraries.

frankieroberto commented 11 years ago

@rlskoeser Hello, @mialondon pointed me here. Not too sure I can help, but the way I usually pull in external libraries onto a Heroku box is via a Gemfile. However, that applies to Ruby and I think you're using Python?

frankieroberto commented 11 years ago

PS https://devcenter.heroku.com/articles/python suggests using 'Pip' for dependency management, if that's any help.

rlskoeser commented 11 years ago

@frankieroberto yep, we're using pip for all of the normal python dependencies. However, as far as I can discover the nltk corpora have to be downloaded via the nltk downloader tool (we should have the stopwords download command documented in the github project readme). I think I tried it on my dev heroku instance last week, but can't remember now if it didn't work or just went into an unexpected place. I'll try to revisit that again (and take notes) when I have a bit of time.

mialondon commented 11 years ago

Ah, sorry Frankie, I knew you'd used Heroku but I thought you used Python rather than Ruby...

mbwolff commented 11 years ago

I just sent a message to Peter Ljunglöf http://www.cse.chalmers.se/~peb/ who handles parsing for NLTK. Will let you know if I hear anything.

anarchivist commented 11 years ago

Hi, you may want to take a look at this StackOverflow post as it may be useful. http://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku

rlskoeser commented 11 years ago

Thanks, I suspected it might be something like that. I'll definitely make use of that when I revisit & document our heroku deploy.

rlskoeser commented 11 years ago

@mbwolff - I generalized your contribution and set up a simple way to handling extra stop words by language that aren't in the nltk corpus. For now, the only extras are the two you added, but I think it should make it very easy to add extra stop words for any of the languages that are currently supported. The updated stop words should go live in the next production update.

@anarchivist - thanks for the link; when I actually went to look at it in detail I discovered that it was pretty much exactly what I had done myself. I guess with heroku's read-only filesystem it must be the only solution.