Data4Democracy / indivisible

Aggregating call to action sites into a single application.
25 stars 19 forks source link

Build email listner #10

Open pghosh opened 7 years ago

pghosh commented 7 years ago

Build listener (listner.py)to pull email from pre-defined email address and get ready for scrapping

nix-bohon commented 7 years ago

I'll take care of this this weekend if you want to go ahead and assign it to me.

nix-bohon commented 7 years ago

I'm getting back to this now. I've registered an email address on Gmail that we should be able to sign up for action emails.

bonfiam commented 7 years ago

Paramita, you have one set up already. How can Zac access it to use the data it has collected already?

nix-bohon commented 7 years ago

@pghosh, I have gotten to a pretty good place on building out the parser, it can pull a gmail account emails over pop3, and then a scraper function on them, but what do you want the email messages to look like at that point? I have them as email.message.Message objects. Does this work or should I just dump the message body as a string and forget about the subject, to, from, etc fields?

pghosh commented 7 years ago

For now let's just have the body dumped. name the email filenames with address/subject/date_time s so that once we add tags we can easily verify the result.

eenblam commented 7 years ago

I'm not in any way an email expert, but what do y'all think about POP? My understanding is that it's been deprecated in favor of IMAP, and the IMAP plumbing can just about write itself.

pghosh commented 7 years ago

We should be marking the messages as read as soon as we have consumed them, so I would go with IMAP.

eenblam commented 7 years ago

Updated version of the above imaplib example, this time compatible with Python 3.

eenblam commented 7 years ago

Regarding the question of what to parse out of the email, would that functionality not be best left to scraper.email_scraper? (i.e. just call self.scrape(msg) as is currently implemented)

EDIT: I suggested this because it separated concerns while maintaining the spec, but it also matches up with a frustrating reality of IMAP. With imaplib, we may very well wish to change the string message_parts passed to fetch(message_set, message_parts). However, doing this can change up the structure of the results, so we need to do one of two things:

The former seems more appealing to me, because we can be flexible with what we do with the incoming emails. We can push them off to a different processing service, we can split up the field data for different purposes, itertools.tee to a log, etc. For instance, we could

from listener import Listener
from scraper import parse
from itertools import tee

# Internally, Listener would start an IMAP connection
session = Listener(config)
msgs = session.yield_msgs(query)
# The structure of the message tuple could change here when we change query
t1, t2 = tee(msgs, 2)
processed = (parse(body, header) for body, header, _ in t1)
persist_or_whatever(processed)
raw_log(t2)
nix-bohon commented 7 years ago

I was thinking something like this. I'll revise so that it doesn't use POP and submit a pull request.

eenblam commented 7 years ago

Note that above I'm getting (body, header, b')') due to a silly choice of message_parts like '(BODY[TEXT] BODY[HEADER.FIELDS SUBJECT TO FROM DATE])'. If we just use '(RFC822)', the parser can get a pretty nice result using email.parser.BytesParser. I just meant to make the point that the listener can be pretty agnostic about what messages are being pulled and what's being done with them.

nix-bohon commented 7 years ago

@eenblam/@pghosh, I've put in a PR for the listener class.

pghosh commented 7 years ago

I like the delegating actual parsing logic because as mentioned, listener can be agnostic and we might as well start adding more ways to funnel message like twitter stream. @zacherybohon I will go through the pull details tonight.