MyIntervals / emogrifier

Converts CSS styles into inline style attributes in your HTML code.
https://www.myintervals.com/emogrifier.php
MIT License
906 stars 153 forks source link

HTML5 parsed as HTML4 #1051

Open bytestream opened 3 years ago

bytestream commented 3 years ago

Given valid HTML5 (according to https://validator.nu/) CSSInliner converts <a><table><tr><td>hello world</td></tr></table></a> to <a></a><table><tr><td>hello world</td></tr></table>

<?php

$html = '<!DOCTYPE html><html lang="en"><head><title>hello world</title></head><body><a><table><tr><td>hello world</td></tr></table></a></body></html>';

echo CssInliner::fromHtml($html)->inlineCss()->render();
/*
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>hello world</title>
</head>
<body>
<a></a><table><tr><td>hello world</td></tr></table>
</body>
</html>
*/

Relates to #828

JakeQZ commented 3 years ago

Hi Kieran,

Good find.

I created a branch with some tests for the transparent content model (including <ins> and <del> as well as <a> for the transparent element, and a few other examples of flow content for it to contain). See https://github.com/MyIntervals/emogrifier/commit/19103dffe505de4ddb363c2197d128f14628e6a1.

It seems only the specific case of <table> being a child of <a> fails. If the <table> is wrapped in a <div> (which I would suggest as a workaround), it parses OK.

Note that it also seems to fail with PHP 8 (though only ran with lowest dependencies before aborting due to other failures), which was supposed to have some improvements to DOM handling. Behaviour is the same on my Windows system as the (presumably Linux) build server.

I think this bug should be reported to those concerned. If it's with PHP or an extension (i.e. DOM), there's bugs.php.net; if it's with libxml, I don't know what the reportage lines are but am sure some exist. (Sounds like it shouldn't be with libxml by name but I don't know how much HTML handling is intertwined with it.)

In the meantime, this does add some gravitas to getting #828 resolved...

You will have hopefully seen #994. As part of that, we would like to decouple the HTML parsing/rendering from the processing. Doing so would make it easier to resolve #828 more cleanly. You would be more than welcome to contribute to the discussion there, particularly on the aspect of decoupling the HTML parsing/rendering.

Cheers, Jake

bytestream commented 3 years ago

Hey Jake!

Thanks for doing some work on this.

It seems only the specific case of <table> being a child of fails

I think there were a few other nuisances identified. I'd have to check but given libxml2 lacks HTML5 (read on) I think such issues are to be expected.

which was supposed to have some improvements to DOM handling

As far as I'm aware the changes in PHP 8 just implement a couple of new methods. The changes don't actually affect the parser

I think this bug should be reported to those concerned

libxml2 lacks HTML5 support (https://gitlab.gnome.org/GNOME/libxml2/-/issues/211) so it's more of a feature request than a bug. It would be faster to use a user land implementation than wait for that issue to resolve.

In the meantime, this does add some gravitas to getting #828 resolved...

Yeah sorry I should have been clearer with my motivation when opening #828 . I figured I'd post this to try and bring some life back into that issue.

JakeQZ commented 3 years ago

libxml2 lacks HTML5 support (https://gitlab.gnome.org/GNOME/libxml2/-/issues/211) so it's more of a feature request than a bug.

That's if this specific issue (and similar) stem from libxml2 rather than ext-dom. I don't see any mention of the transparent content model in that issue report. I am not sure (it is hard to find specific outdated specs without a bit of research which I'm not going to do right now), but my instinct is that this was valid HTML4 as well.

It would be faster to use a user land implementation than wait for that issue to resolve.

I am inclined to agree, but I think the official avenue(s) should be pushed as well, simultaneously. In the long run, it would be better not to have to use "user land" solutions, but they can be a good stop gap in the meantime.

Yeah sorry I should have been clearer with my motivation when opening #828

I/we understand the issues. For example, see #653; there was a relatively simple workaround in that case.

I think there were a few other nuisances identified.

If there were, and you can re-find them, they would be really useful for testing.

JakeQZ commented 3 years ago

I think there were a few other nuisances identified.

I just re-found some relating to optional tag omission. The "broken" comments in the test data providers refer to cases where DOMDocument does not handle tag omission as it should. I think I added these tests out of concern that the Masterminds parser would not handle tag omission properly. I seem to remember it had problems with <html>, <head> and <body> tag omission, but dealt much better than DOMDocument with other tag omissions.

bytestream commented 3 years ago

I think there were a few other nuisances identified.

There's several reports in both Symfony and Roundcube which is why they added optionally parsing via masterminds/html5. I couldn't find any more specific cases though from a quick look.

I seem to remember it had problems with , and tag omission

Yeah correct. It's fairly easy to workaround this specific issue though.