Add More Stopword Lists

jkterry1 commented 7 years ago

After the current round of PRs are worked out, we should build in more stop words. I vote adding all the ones here, along with any others asked for: http://www.ranks.nl/stopwords . Also @fabianvf one of these is what I used as a test file, think that'll cause a problem?

fabianvf commented 7 years ago

hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url...

jkterry1 commented 7 years ago

I thought about that, but in some they'll just be a few kilobytes, so I don't think it's worth adding that. Regarding adding URLs, we can, but it doesn't seem worth it since it'd have to be in a txt styled format (which is rare to find online in my experience) and they really should save it locally anyways.

On Jul 31, 2017 11:55 AM, "Fabian von Feilitzsch" notifications@github.com wrote:

hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fabianvf/python-rake/issues/20#issuecomment-319113029, or mute the thread https://github.com/notifications/unsubscribe-auth/AShd7LU-k1wYp4Hed3Yl6xu7S2nd8nTgks5sTfjigaJpZM4Okq9u .

fabianvf commented 7 years ago

Well, if you went the URL route I'd thought you'd provide a URL and separation regex, so like

RAKE.load_stopwords('http://example.com/beststopwords', re.compile('super-cool-regex'))

so it wouldn't matter how they formatted it so long as it was a list of some kind. Just feel like it would be convenient, especially if you were just hacking/prototyping and wanted to experiment with different stoplists, without requiring you to download/format them manually.

jkterry1 commented 7 years ago

Interesting. You may be right that that's a useful feature and I don't see it, but I've never seen someone who wanted to do that as a data scientist. Also it'd require more than just a regex for the vast majority of sites--it'd require playing around in beautiful soup or something too. The way I've seen everyone do it because it's always been the fastest has been to copy and paste into ipython and do some quick for loop.

fabianvf commented 7 years ago

It looks like this project has amassed a large group of stopwords lists from a variety of sources, do you think we could leverage this work? https://github.com/igorbrigadir/stopwords

jkterry1 commented 7 years ago

For posterities sake:

Hi Justin,

Thanks for asking. Yes you can use our stopword lists if you credit 'ranks.nl'

Does your script work with HTML documents or text without markup only ?

If HTML, I'm curious if you've had a chance to test the results from the Page Analyzer tool on ranks.nl ? It is basically a tool for Automatic Keyword Extraction from Individual HTML Documents.

Kind regards, Damian Doyle Ranks NL

On Tue, Aug 1, 2017 at 10:02 PM, Justin Terry justinkterry@gmail.com wrote: Hello, I'm working on an MIT licensed open source natural language processing tool in python: https://github.com/fabianvf/python-rake

Can I include your stop word lists into the package by default if I credit you?

-- Thank you for your time, Justin Terry

jkterry1 commented 7 years ago

@fabianvf please close this, I fixed this in my last PR that you merged and forgot to mention it.

jkterry1 commented 7 years ago

nevermind apparnetly i can now

fabianvf / python-rake

Add More Stopword Lists #20