Closed rando305 closed 11 months ago
Unfortunately extracting text from HTML emails just isn't that obvious. Saving the HTML part as an attachment turned out to be the best balance of accuracy and convenience when I first wrote this a few years back, and it can probably be improved today.
Doing a search for eg <body>
will be insufficient as every email client has it's own bastardised version of HTML for outgoing messages. Some don't have a <body>
, some do but have hundreds of <div>
s, some just put <br>
after each line, and all sorts of other dodgy behaviours.
The complexity of html2text shows how difficult this is to get right. Perhaps we can utilise html2text if it's installed? I know that some will have issues with it being released under the GPL, but giving people the option to use it (ie use if installed, do nothing if not installed) should be reasonable... thoughts?
Other options might include django.utils.html.strip_tags (which just pulls out all <>
tags, which might not lead to a clean outcome) or bleach.
Seems like it would be difficult to get it working , but at the same time just getting 80% of the cases right would be better than nothing. I haven't used the ticket system "in anger" yet, but I can see it might save alot of time to show something, eg:
On the other hand, it doesn't look like many people are asking for it?
Is there any reason we couldn't use beautifulsoup
to do this? It would be an extra dependency but maybe a solution, a tutorial says print(soup.get_text())
is enough to get text out of HTML. Anyone have experience with it?
Yes, I both have experience with beautiful soup and agree that it would mostly work for this, enough that it would be worth using. BS is more forgiving of broken HTML than anything in std.lib., and broken HTML has, historically at least, been pqrticularly prevelant in HTML email.
Adding this dependency could also be a step towards evaluating a pluggable app that would replace a larger component of code that includes this functionality. I'm on my mobile at the moment and don't have access to the relevant bookmark right now, but I'll add a link to an established third party package for Django inbound email processing for your consideration once I get back.
From issue #504 add CC: @willstott101
The email parsing code has been re-written since this bug was submitted.
In the ticket description - an html message gets submitted as an attachment - with the message to see the attachment.
Would it be possible to simply search the html for -
<body
, then search to<p
, then to>
and take all the text until the</p>
That would be quite easy - and quite useful - keep the blurb about the attached html, but at least make the description more useful without having to get into the attachments.
Just a suggestion - Thanks for the development! I'm going to use it as an internal process flow that our external partners can use and monitor the status of their requests. I think it is easier for me to teach them how to email requests than to log into the system. :-)