Open pghosh opened 7 years ago
I'll take care of this this weekend if you want to go ahead and assign it to me.
I'm getting back to this now. I've registered an email address on Gmail that we should be able to sign up for action emails.
Paramita, you have one set up already. How can Zac access it to use the data it has collected already?
@pghosh, I have gotten to a pretty good place on building out the parser, it can pull a gmail account emails over pop3, and then a scraper function on them, but what do you want the email messages to look like at that point? I have them as email.message.Message objects. Does this work or should I just dump the message body as a string and forget about the subject, to, from, etc fields?
For now let's just have the body dumped. name the email filenames with address/subject/date_time s so that once we add tags we can easily verify the result.
I'm not in any way an email expert, but what do y'all think about POP? My understanding is that it's been deprecated in favor of IMAP, and the IMAP plumbing can just about write itself.
We should be marking the messages as read as soon as we have consumed them, so I would go with IMAP.
Updated version of the above imaplib example, this time compatible with Python 3.
Regarding the question of what to parse out of the email, would that functionality not be best left to scraper.email_scraper
? (i.e. just call self.scrape(msg)
as is currently implemented)
EDIT: I suggested this because it separated concerns while maintaining the spec, but it also matches up with a frustrating reality of IMAP. With imaplib, we may very well wish to change the string message_parts
passed to fetch(message_set, message_parts)
. However, doing this can change up the structure of the results, so we need to do one of two things:
yield
back results without looking at themThe former seems more appealing to me, because we can be flexible with what we do with the incoming emails. We can push them off to a different processing service, we can split up the field data for different purposes, itertools.tee
to a log, etc. For instance, we could
from listener import Listener
from scraper import parse
from itertools import tee
# Internally, Listener would start an IMAP connection
session = Listener(config)
msgs = session.yield_msgs(query)
# The structure of the message tuple could change here when we change query
t1, t2 = tee(msgs, 2)
processed = (parse(body, header) for body, header, _ in t1)
persist_or_whatever(processed)
raw_log(t2)
I was thinking something like this. I'll revise so that it doesn't use POP and submit a pull request.
Note that above I'm getting (body, header, b')')
due to a silly choice of message_parts
like '(BODY[TEXT] BODY[HEADER.FIELDS SUBJECT TO FROM DATE])'
. If we just use '(RFC822)'
, the parser can get a pretty nice result using email.parser.BytesParser
. I just meant to make the point that the listener can be pretty agnostic about what messages are being pulled and what's being done with them.
@eenblam/@pghosh, I've put in a PR for the listener class.
I like the delegating actual parsing logic because as mentioned, listener can be agnostic and we might as well start adding more ways to funnel message like twitter stream. @zacherybohon I will go through the pull details tonight.
Build listener (listner.py)to pull email from pre-defined email address and get ready for scrapping