Data4Democracy / indivisible

Aggregating call to action sites into a single application.
25 stars 19 forks source link

Scrap email content #8

Open pghosh opened 7 years ago

pghosh commented 7 years ago

Background

Action sites sends emails to the subscribers with action/event informations. These emails are in different format. This task is to identify best way to scrap the email body and save them as text so that further analysis can be done .

Acceptance Criteria

Make a call to Scraper.scrape for a given email and save content as raw text.

pghosh commented 7 years ago
Design thoughts

Scraper.scrap is the entry point of all the scrapping. Depending on type of emails (i.e simple html vs pictures vs plain text) we can have multiple method definitions if required and some delegator .

eenblam commented 7 years ago

I'm currently working on this. I'll probably have a PR soon-ish for small stuff, but the more involved bits might have to wait until next weekend. If someone else wants to knock it out in the meantime, more power to you. :)

eenblam commented 7 years ago

What are our example use cases for handling these differently? Anything less straightforward than something like this?

  1. Identify attachments
  2. Write attachment to disk
  3. Persist email_body, email_header, (attachment_path, attachment_headers); attachments flagged as unverified until validation succeeds
  4. Push attachments onto queue for security validation; external service?

I'm glossing over the actual parsing; I just want to make sure I'm not overlooking something.

Regarding number 4: what are our plans to ensure safe handling and storage of attachments? To avoid forwarding malicious attachments? I'm always happy to learn more about security, but my appsec-fu is weak here, and I'd rather not accidentally forward FinSpy to a bunch of activists.

Similarly, we need to escape JS embedded in email body. This should be implemented sooner than later.

https://zeltser.com/analyzing-malicious-documents/

eenblam commented 7 years ago

References relevant to this issue: