anidata / ht-etl

Anidata 1.0: ETL and algorithm code.
0 stars 10 forks source link

Create Luigi Task to parse emails #5

Closed bmenn closed 7 years ago

bmenn commented 7 years ago

From the requirements of #1, we need to be able to scrape out emails of all manner from raw HTML. Would also suggest creating a EmailAddress table with an auto increment ID, so email could be tracked across multiple sites.

Examples of email address formats to handle (not exhaustive):

bmenn commented 7 years ago

@danlrobertson

Do you have any updates or need help on this issue?

lahoffm commented 7 years ago

I mostly solved this (see pull request) but still needs improved regular expressions and parsing multiple emails from a posting instead of just the first one. Could be good mini-project for people newer to programming.

dlrobertson commented 7 years ago

Solved between #6 and #11