dgarrick / headliner

headliner is a service that clusters news articles around common stories
21 stars 3 forks source link

Merge clusters like "trumps" if "trump" exists. Otherwise don't. #38

Open campbellcompton opened 7 years ago

evancofer commented 6 years ago

You can probably use or modify an existing stemming or lemmatization algorithm or library for this (See https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html for definition). I believe python's nltk already has support for both. I would implement this but I don't have the available bandwidth in my work schedule.

dgarrick commented 6 years ago

We've had problems deploying nltk to Heroku before, but it sounds like using it here would be worth investigating.

evancofer commented 6 years ago

There are almost certainly other alternatives, but NLTK is probably the most commonly used.