dwillis / nicarl-archives

An exploration of the NICAR-L listserv for NICAR 2019
4 stars 0 forks source link

Potentially useful tools #1

Open simonw opened 5 years ago

simonw commented 5 years ago

Starting a thread here in an issue (the Wiki isn't enabled for this project, plus I don't think GitHub Wikis allow contributions from people who aren't project maintainers).

simonw commented 5 years ago

https://github.com/mailgun/talon by Mailgun is a sophisticated Python library for extracting signatures and quotations from emails - good for cleaning things up before attempting to ingest them into a search engine.

simonw commented 5 years ago

Also from Mailgun: https://github.com/mailgun/flanker - useful for normalizing email addresses

simonw commented 5 years ago

Since I throw SQLite at basically everything now, I'm going to suggest loading this data into SQLite. It has surprisingly good built-in full-text search. I wrote about that here: https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/

My sqlite-utils library makes importing data into a new SQLite database really easy: https://simonwillison.net/2019/Feb/25/sqlite-utils/ and https://github.com/simonw/sqlite-utils

dwillis commented 5 years ago

Some others:

SpamScope mail parser GitHub's email parsing topic JetBrains email parser