Adding basic email headers breaks HTML

andrewferrier / email2pdf

Script to convert emails to PDF from the command-line, as well as detach recognized attachments. Helps to process incoming emails and assist automatically with a non-paper paperwork workflow. Designed to work in tandem with getmail to convert forwarded emails to PDF automatically.

MIT License

67 stars 36 forks source link

Adding basic email headers breaks HTML #123

Open broth-itk opened 4 years ago

broth-itk commented 4 years ago

Payload has valid HTML code. The headers will be added in front of the <html> start tag which breaks HTML standard:

if args.headers:
            header_info = get_formatted_header_info(input_email)
            logger.info("Header info is: " + header_info)
            payload = header_info + payload

broth-itk commented 4 years ago

Proposal:

if args.headers:
            header_info = get_formatted_header_info(input_email)
            logger.info("Header info is: " + header_info)
            soup = BeautifulSoup(payload, "html.parser")
            soup.body.insert(1, BeautifulSoup(header_info, 'html.parser'))
            payload = str(soup)

andrewferrier commented 4 years ago

@broth-itk I think I get the general intent here, but what kind of problem are you trying to solve specifically? In practice, this generally speaking seems to work for me. I'm cautious about running the email body through the BS parser unless there's a compelling reason to do so.

broth-itk commented 4 years ago

@andrewferrier: Thanks for your feedback! I just wanted to point out that the HTML code will be invalidated when adding headers. wkhtmltopdf does seem to handle that issue fine but IMHO we should feed it with proper HTML code. In my case I modified the script to add even another header to the resulting HTML code. This might get me into troubles.

I am not a fan sending all though BS parser as well. Maybe it's simpler to search the string for <body> tag and insert the code right there. Will have the same effect IMHO.

Just my 2 cents