Closed sh-at-cs closed 8 months ago
I tracked it down to the changes made in https://github.com/FelixSchwarz/mjml-python/commit/84c495da20a91640a1ca551ace17df7f3be644aa (https://github.com/FelixSchwarz/mjml-python/pull/45), where the formatter
argument to BeautifulSoup's decode_contents
was explicitly set to None
. If it is set to "minimal"
instead (the default and prior value), HTML escape sequences are left as they are.
But it seems like (part of) the whole point of https://github.com/FelixSchwarz/mjml-python/pull/45 was to not leave HTML entities as they are so as to fix https://github.com/FelixSchwarz/mjml-python/issues/44...
@caseyjhol Care to comment? :slightly_smiling_face:
Ah I could've sworn I tested and accounted for this scenario, but clearly I missed the mark. I think the better approach then might be to limit the scope to mjStyle
rather than all HTML.
@caseyjhol Thanks for the quick response! You're right, you did add an example of a very similar situation: https://github.com/FelixSchwarz/mjml-python/blob/713d0aad96837d12f83627736eaa35e18ff3b9ec/tests/testdata/css-inlining.mjml#L22 https://github.com/FelixSchwarz/mjml-python/blob/713d0aad96837d12f83627736eaa35e18ff3b9ec/tests/testdata/css-inlining-expected.html#L127
But as it turns out, the entities are in fact "un-escaped" during rendering of the test MJML file as described in this issue - the reason the test passes anyway is that the space between <
and Hello
makes HTMLCompare consider them equal to < Hello
(the rendered output). Minimal example:
# no exception:
htmlcompare.assert_same_html("< script >", "< script >")
# exception:
htmlcompare.assert_same_html("<script >", "<script >")
I haven't checked HTMLCompare's code, but I think this is probably because putting a space between <
and the tag name makes it stop being an HTML tag, so the <
is in fact equivalent to <
in this context.
But that is exactly the case that people don't need to guard against with sanitization/escaping. So the fix for this issue should add another test case (or extend this one) with an escaped HTML tag like <script>
.
@sh-at-cs Thank you for reporting this issue, including the detailed analysis. I think this is a serious issue which we need to fix. I'll try to spend some time on this later - either on reviewing a solution or trying to fix this myself. When we merged #45 somehow the security implications completely escaped my attention.
Unfortunately all could do yesterday was to add the test case in a new branch fix-escaped-html-tags.
How should we fix this issue? I see two approaches (just brainstorming here):
<mj-style>
should be kept as-isWould that solve the issue if we also remove the formatter=None
? As far as I could see, css_inline
will keep the escaped sequences (as it should).
Going to try to dig into this today a bit.
I pushed some additional code in the branch fix-escaped-html-tags. Basically the idea is to do the unescape only for contents of <mj-style>
.
I also added CSS parsing using tinycss2. My idea is that this would blow up if the contents would be invalid (e.g. other HTML code) but I did not test that. Not sure if it is worth the additional dependency because arbitrary CSS can influence the displayed contents...
Update: It seems like tinycss happily parses even completely invalid HTML but I think css_inline would remove that. Should we just assume that the <mj-style>
does not contain untrusted user input? (and maybe add a section to the readme?)
Also maybe you can also check the security advisory draft I created for this issue. Feel free to suggest additions and please check if you agree with the severity classification.
Consider:
In the resulting HTML output, the formerly HTML-escaped
<
(<
) and>
(>
) are "un-escaped", so the rendered HTML actually containsPretty unsafe: <script>
.Why does this happen? This reverses the user's safety measures and can be dangerous.
The MJML reference implementation doesn't do this and correctly keeps such escape sequences untouched: https://mjml.io/try-it-live/fvvhZhdu9V