edgi-govdata-archiving / web-monitoring-task-sheets

Experimental new tool for generating weekly analyst task sheets for web monitoring
GNU General Public License v3.0
3 stars 0 forks source link

Use word stems for key term matching #1

Open Mr0grog opened 4 years ago

Mr0grog commented 4 years ago

When we check whether any key terms have changed, we should try out using word stemming instead of exact matches: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/074cc01b8ec81a70298b897eaf86bc6ad15fba6c/analyst_sheets/analyze.py#L190-L195

This should be something we can turn on/off, since I’m not sure how well it will work and whether we’ll get a lot of false positives.

To keep things comprehensible, we need to keep a map of "stemmed terms" → "actual terms" so that we can present them as the actual terms, even though we are matching by stem.

NLTK supports several different stemming implementations, so I need to do some reading and testing as to what makes the most sense. API docs: https://www.nltk.org/api/nltk.stem.html