no plain text when HTML-only email (no plain alternative)

django-helpdesk / django-helpdesk

A Django application to manage tickets for an internal helpdesk. Formerly known as Jutda Helpdesk.

BSD 3-Clause "New" or "Revised" License

1.56k stars 641 forks source link

no plain text when HTML-only email (no plain alternative) #304

Closed rando305 closed 11 months ago

rando305 commented 9 years ago

In the ticket description - an html message gets submitted as an attachment - with the message to see the attachment.

Would it be possible to simply search the html for - <body, then search to <p, then to > and take all the text until the </p>

That would be quite easy - and quite useful - keep the blurb about the attached html, but at least make the description more useful without having to get into the attachments.

Just a suggestion - Thanks for the development! I'm going to use it as an internal process flow that our external partners can use and monitor the status of their requests. I think it is easier for me to teach them how to email requests than to log into the system. :-)

rossp commented 9 years ago

Unfortunately extracting text from HTML emails just isn't that obvious. Saving the HTML part as an attachment turned out to be the best balance of accuracy and convenience when I first wrote this a few years back, and it can probably be improved today.

Doing a search for eg <body> will be insufficient as every email client has it's own bastardised version of HTML for outgoing messages. Some don't have a <body>, some do but have hundreds of <div>s, some just put <br> after each line, and all sorts of other dodgy behaviours.

The complexity of html2text shows how difficult this is to get right. Perhaps we can utilise html2text if it's installed? I know that some will have issues with it being released under the GPL, but giving people the option to use it (ie use if installed, do nothing if not installed) should be reasonable... thoughts?

rossp commented 9 years ago

Other options might include django.utils.html.strip_tags (which just pulls out all <> tags, which might not lead to a clean outcome) or bleach.

ssadler commented 9 years ago

Seems like it would be difficult to get it working , but at the same time just getting 80% of the cases right would be better than nothing. I haven't used the ticket system "in anger" yet, but I can see it might save alot of time to show something, eg:

Fairly simple algorithm which doesn't try too hard to make sense of more complicated cases
Maximum length in lines beyond which you have to click the attachment to see the rest

On the other hand, it doesn't look like many people are asking for it?

gwasser commented 7 years ago

Is there any reason we couldn't use beautifulsoup to do this? It would be an extra dependency but maybe a solution, a tutorial says print(soup.get_text()) is enough to get text out of HTML. Anyone have experience with it?

reduxionist commented 7 years ago

Yes, I both have experience with beautiful soup and agree that it would mostly work for this, enough that it would be worth using. BS is more forgiving of broken HTML than anything in std.lib., and broken HTML has, historically at least, been pqrticularly prevelant in HTML email.

Adding this dependency could also be a step towards evaluating a pluggable app that would replace a larger component of code that includes this functionality. I'm on my mobile at the moment and don't have access to the relevant bookmark right now, but I'll add a link to an established third party package for Django inbound email processing for your consideration once I get back.

reduxionist commented 7 years ago

From issue #504 add CC: @willstott101

timthelion commented 11 months ago

The email parsing code has been re-written since this bug was submitted.