Open pghosh opened 7 years ago
Scraper.scrap is the entry point of all the scrapping. Depending on type of emails (i.e simple html vs pictures vs plain text) we can have multiple method definitions if required and some delegator .
I'm currently working on this. I'll probably have a PR soon-ish for small stuff, but the more involved bits might have to wait until next weekend. If someone else wants to knock it out in the meantime, more power to you. :)
What are our example use cases for handling these differently? Anything less straightforward than something like this?
email_body
, email_header
, (attachment_path, attachment_headers)
; attachments flagged as unverified until validation succeedsI'm glossing over the actual parsing; I just want to make sure I'm not overlooking something.
Regarding number 4: what are our plans to ensure safe handling and storage of attachments? To avoid forwarding malicious attachments? I'm always happy to learn more about security, but my appsec-fu is weak here, and I'd rather not accidentally forward FinSpy to a bunch of activists.
Similarly, we need to escape JS embedded in email body. This should be implemented sooner than later.
References relevant to this issue:
Background
Action sites sends emails to the subscribers with action/event informations. These emails are in different format. This task is to identify best way to scrap the email body and save them as text so that further analysis can be done .
Acceptance Criteria
Make a call to Scraper.scrape for a given email and save content as raw text.