This should be something we can turn on/off, since I’m not sure how well it will work and whether we’ll get a lot of false positives.
To keep things comprehensible, we need to keep a map of "stemmed terms" → "actual terms" so that we can present them as the actual terms, even though we are matching by stem.
NLTK supports several different stemming implementations, so I need to do some reading and testing as to what makes the most sense. API docs: https://www.nltk.org/api/nltk.stem.html
When we check whether any key terms have changed, we should try out using word stemming instead of exact matches: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/074cc01b8ec81a70298b897eaf86bc6ad15fba6c/analyst_sheets/analyze.py#L190-L195
This should be something we can turn on/off, since I’m not sure how well it will work and whether we’ll get a lot of false positives.
To keep things comprehensible, we need to keep a map of "stemmed terms" → "actual terms" so that we can present them as the actual terms, even though we are matching by stem.
NLTK supports several different stemming implementations, so I need to do some reading and testing as to what makes the most sense. API docs: https://www.nltk.org/api/nltk.stem.html