Escaped HTML tags are "un-escaped" when rendering HTML

sh-at-cs commented 9 months ago

Consider:

from mjml import mjml_to_html

result = mjml_to_html("""
<mjml>
  <mj-body>
    <mj-text>
      Pretty unsafe: &lt;script&gt;
    </mj-text>
  </mj-body>
</mjml>""")

print(result["html"])

In the resulting HTML output, the formerly HTML-escaped < (<) and > (&gt) are "un-escaped", so the rendered HTML actually contains Pretty unsafe: <script>.

Why does this happen? This reverses the user's safety measures and can be dangerous.

The MJML reference implementation doesn't do this and correctly keeps such escape sequences untouched: https://mjml.io/try-it-live/fvvhZhdu9V

sh-at-cs commented 9 months ago

I tracked it down to the changes made in https://github.com/FelixSchwarz/mjml-python/commit/84c495da20a91640a1ca551ace17df7f3be644aa (https://github.com/FelixSchwarz/mjml-python/pull/45), where the formatter argument to BeautifulSoup's decode_contents was explicitly set to None. If it is set to "minimal" instead (the default and prior value), HTML escape sequences are left as they are.

But it seems like (part of) the whole point of https://github.com/FelixSchwarz/mjml-python/pull/45 was to not leave HTML entities as they are so as to fix https://github.com/FelixSchwarz/mjml-python/issues/44...

@caseyjhol Care to comment? :slightly_smiling_face:

caseyjhol commented 9 months ago

Ah I could've sworn I tested and accounted for this scenario, but clearly I missed the mark. I think the better approach then might be to limit the scope to mjStyle rather than all HTML.

sh-at-cs commented 9 months ago

@caseyjhol Thanks for the quick response! You're right, you did add an example of a very similar situation: https://github.com/FelixSchwarz/mjml-python/blob/713d0aad96837d12f83627736eaa35e18ff3b9ec/tests/testdata/css-inlining.mjml#L22 https://github.com/FelixSchwarz/mjml-python/blob/713d0aad96837d12f83627736eaa35e18ff3b9ec/tests/testdata/css-inlining-expected.html#L127

But as it turns out, the entities are in fact "un-escaped" during rendering of the test MJML file as described in this issue - the reason the test passes anyway is that the space between < and Hello makes HTMLCompare consider them equal to < Hello (the rendered output). Minimal example:

# no exception:
htmlcompare.assert_same_html("&lt; script &gt;", "< script >")
# exception:
htmlcompare.assert_same_html("&lt;script &gt;", "<script >")

I haven't checked HTMLCompare's code, but I think this is probably because putting a space between < and the tag name makes it stop being an HTML tag, so the < is in fact equivalent to < in this context.

But that is exactly the case that people don't need to guard against with sanitization/escaping. So the fix for this issue should add another test case (or extend this one) with an escaped HTML tag like <script>.

FelixSchwarz commented 9 months ago

@sh-at-cs Thank you for reporting this issue, including the detailed analysis. I think this is a serious issue which we need to fix. I'll try to spend some time on this later - either on reviewing a solution or trying to fix this myself. When we merged #45 somehow the security implications completely escaped my attention.

FelixSchwarz commented 8 months ago

Unfortunately all could do yesterday was to add the test case in a new branch fix-escaped-html-tags.

How should we fix this issue? I see two approaches (just brainstorming here):

contents of <mj-style> should be kept as-is
possibly parse the contents using tinycss2 to ensure this is really just CSS

Would that solve the issue if we also remove the formatter=None? As far as I could see, css_inline will keep the escaped sequences (as it should).

caseyjhol commented 8 months ago

Going to try to dig into this today a bit.

FelixSchwarz commented 8 months ago

I pushed some additional code in the branch fix-escaped-html-tags. Basically the idea is to do the unescape only for contents of <mj-style>.

I also added CSS parsing using tinycss2. My idea is that this would blow up if the contents would be invalid (e.g. other HTML code) but I did not test that. Not sure if it is worth the additional dependency because arbitrary CSS can influence the displayed contents...

Update: It seems like tinycss happily parses even completely invalid HTML but I think css_inline would remove that. Should we just assume that the <mj-style> does not contain untrusted user input? (and maybe add a section to the readme?)

FelixSchwarz commented 8 months ago

Also maybe you can also check the security advisory draft I created for this issue. Feel free to suggest additions and please check if you agree with the severity classification.

FelixSchwarz / mjml-python

Escaped HTML tags are "un-escaped" when rendering HTML #52