OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
847 stars 213 forks source link

SVG path markup bug and Changing default escapingMode on the HtmlStreamRenderer #122

Open dpena opened 7 years ago

dpena commented 7 years ago

The problem I'm trying to solve

I was trying to use this tool for the stripping all dangerous markup. One thing I'm trying to do though is allow the user to input custom {{ icon }} which I then use Handlebars to replace with something else. So the workflow is:

  1. User input
  2. Convert icon to svg markup using handlebars
  3. Pass the full input through the sanitizer
  4. Output to page.

Bug 1?

However I noticed that when allowingElement svg and path to the Policy the markup for paths break.

The sanitizer turns an svg and its path like this:

<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" class="someclass">
    <path id="bounds" opacity="0" d="M0 0h24v24H0z"/>
    <path d="[svg content1]"/>
    <path d="[svg content2]"/>
</svg>

into this:

<path d="M 100 100 L 300 100 L 200 300 z"
        fill="red" stroke="blue" stroke-width="3" ></path>

<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewbox="0 0 24 24" class="htc-icon htc-icon--checkmark_circle test">
--
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewbox="0 0 24 24" class="someclass">
 <path id="bounds" opacity="0" d="M0 0h24v24H0z"> 
   <path d="[svg content1]"> 
      <path d="[svg content2]"></path>
    </path>
  </path>
</svg>

However, that isn't valid. Note how the paths get nested.

My code

I'm using the SlashdotPolicy as an example and made some changes:

new HtmlPolicyBuilder()
      .allowStandardUrlProtocols()
      .allowCommonInlineFormattingElements()
      // Allow title="..." on any element.
      .allowAttributes("title").globally()
      // Allow href="..." on <a> elements.
      .allowAttributes("href").onElements("a")
      // Defeat link spammers.
      .requireRelNofollowOnLinks()
      // Allow lang= with an alphabetic value on any element.
      .allowAttributes("lang").matching(Pattern.compile("[a-zA-Z]{2,20}"))
      .globally()
      // The align attribute on <p> elements can have any value below.
      .allowAttributes("align")
      .matching(true, "center", "left", "right", "justify", "char")
      .onElements("p")
      // These elements are allowed.
      .allowElements(
          "a", "p", "div", "i", "b", "em", "blockquote", "tt", "strong",
          "br", "ul", "ol", "li")
      // Allow the SVGs we're using
      .allowElements("svg")
      // Allow the svg Attributes
      .allowAttributes("xmlns", "class", "width", "height", "viewbox")
      .onElements("svg")
      .allowElements("path")
      // Allow the path Attributes
      .allowAttributes("id", "opacity", "d").onElements("path")
      .toFactory();

Bug 2

So because of how this was happening I swapped the order of my workflow:

  1. User input
  2. Pass the full input through the sanitizer
  3. Convert icon to svg markup using handlebars
  4. Output to page.

When i tried to sanitize first and convert the handlebars to my icons, the markup would change:

"{{" ---> "{<!-- -->{"

when looking at the tests, it looks like it is intentional. (https://github.com/OWASP/java-html-sanitizer/blob/34425c15adb7bd32609cefc01e8f8d4534a3cf34/src/test/java/org/owasp/html/EncodingTest.java#L231) That's ok, but it would be nice to change the default HtmlStreamRenderer.escapingMode of text.

My current solution

In my case I am using the following workflow:

  1. User input
  2. Pass the full input through the sanitizer
  3. Run a replace all on the sanitizer output
    • output.toString().replaceAll("\\{\\<!-- +--\\>\\{", "{{");
  4. Convert icon to svg markup using handlebars
  5. Output to page.

This isn't really ideal. But it will work for now.

jmanico commented 7 years ago

I was trying to use this tool for the stripping all dangerous markup. One thing I'm trying to do though is allow the user to input custom {{ icon }} which I then use Handlebars to replace with something else.

I'm sure Mike will jump in when he has time, but my take is this is not the purpose of the tool. The HTML sanitizer is only meant to accept snippets of HTML markup for sanitization from tools like TinyMCE or from web features that allows users to include some HTML tags in content. This tool is not intended to sanitize per-rendered template markup that is not standard HTML. The HTML sanitizer also takes a fairly aggressive stance on parsing which focused primarily on safety. When it sees things that are not standard markup it delints and cleans up. It also has a fairly aggressive and safe perspective on handling HTML comments.

The SVG issue looks like a potential legit bug....

Anyhow, please give Mike some time to respond, he'll do so as soon as he can.

Aloha,

Jim

On 8/15/17 6:08 PM, dpena wrote:

The problem I'm trying to solve

I was trying to use this tool for the stripping all dangerous markup. One thing I'm trying to do though is allow the user to input custom {{ icon }} which I then use Handlebars to replace with something else. So the workflow is:

  1. User input
  2. Convert icon to svg markup using handlebars
  3. Pass the full input through the sanitizer
  4. Output to page.

    Bug 1?

However I noticed that when allowingElement svg and path to the Policy the markup for paths break.

The sanitizer turns an svg and its path like this:

|<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" class="someclass"> <path id="bounds" opacity="0" d="M0 0h24v24H0z"/> |

into this:

|<path d="M 100 100 L 300 100 L 200 300 z" fill="red" stroke="blue" stroke-width="3" > <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewbox="0 0 24 24" class="htc-icon htc-icon--checkmark_circle test"> -- <svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewbox="0 0 24 24" class="someclass"> |

However, that isn't valid. Note how the paths get nested.

My code

I'm using the SlashdotPolicy as an example and made some changes:

|new HtmlPolicyBuilder() .allowStandardUrlProtocols() .allowCommonInlineFormattingElements() // Allow title="..." on any element. .allowAttributes("title").globally() // Allow href="..." on elements. .allowAttributes("href").onElements("a") // Defeat link spammers. .requireRelNofollowOnLinks() // Allow lang= with an alphabetic value on any element. .allowAttributes("lang").matching(Pattern.compile("[a-zA-Z]{2,20}")) .globally() // The align attribute on

elements can have any value below. .allowAttributes("align") .matching(true, "center", "left", "right", "justify", "char") .onElements("p") // These elements are allowed. .allowElements( "a", "p", "div", "i", "b", "em", "blockquote", "tt", "strong", "br", "ul", "ol", "li") // Allow the SVGs we're using .allowElements("svg") // Allow the svg Attributes .allowAttributes("xmlns", "class", "width", "height", "viewbox") .onElements("svg") .allowElements("path") // Allow the path Attributes .allowAttributes("id", "opacity", "d").onElements("path") .toFactory(); |

Bug 2

So because of how this was happening I swapped the order of my workflow:

  1. User input
  2. Pass the full input through the sanitizer
  3. Convert icon to svg markup using handlebars
  4. Output to page.

When i tried to sanitize first and convert the handlebars to my icons, the markup would change:

"{{" ---> "{{"

when looking at the tests, it looks like it is intentional. (https://github.com/OWASP/java-html-sanitizer/blob/34425c15adb7bd32609cefc01e8f8d4534a3cf34/src/test/java/org/owasp/html/EncodingTest.java#L231) That's ok, but it would be nice to change the default HtmlStreamRenderer.escapingMode of text.

My current solution

In my case I am using the following workflow:

  1. User input

  2. Pass the full input through the sanitizer

  3. Run a replace all on the sanitizer output

    • output.toString().replaceAll("{\<!-- +-->{", "{{");
  4. Convert icon to svg markup using handlebars

  5. Output to page.

This isn't really ideal. But it will work for now.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OWASP/java-html-sanitizer/issues/122, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgcCaVv-HGVsB22P2eZf8kgkSollKPDks5sYhbFgaJpZM4O4LTO.

-- Jim Manico Manicode Security https://www.manicode.com

mikesamuel commented 6 years ago

Right now, there's no handling of foreign XML contexts.

I think the simplest fix would be to add SVG to the table of element and attribute definitions.

I think MathML in HTML has been deprecated (see support), so I think I can

  1. add elements in https://www.w3.org/TR/SVG/eltindex.html to containment-checks and regenerate the tables.
  2. change the element stack to allow svg elements when an <svg> element is on the stack
  3. change the renderer to remap tag names to element names when in an SVG context.
    Tag Name Element Name
    altglyph altGlyph
    ... ...

Hopefully this should fix bug 1 and obviate bug 2.

wookie41 commented 5 years ago

@mikesamuel Hi. I've ran into the same problem (bug 1) that @dpena , so I assume, that you didn't regenerate the tables. I've tried doing it myself - I added a new element to the containment-checks table and ran rebuild.sh, but didn't notice any change. Can you provide with a step by step guide for regenerating the tables, so people can add custom markup?

mikesamuel commented 5 years ago

@wookie41, I haven't run that in a while. What did you try? I'd probably start by adding SVG elements like svg, path, etc. to

https://github.com/OWASP/java-html-sanitizer/blob/0c04e4be08a25ebc396784dce72c65a2b8f6665f/empiricism/html-containment.html#L45-L68

wookie41 commented 5 years ago

@mikesamuel I added the elements to html-containment.html and ran rebuild.sh. It said that the scripts has finished successfully and that it copied the generated class to ../src/main/java/org/owasp/html/HtmlElementTablesCanned.java, but I didn't see the elements that I added in it.

wookie41 commented 5 years ago

@mikesamuel up?

jonathan-potter commented 3 years ago

I recently attempted to use this to sanitize svgs. I also ran into bug 1.

shasait commented 1 year ago

Hi - I have the same issue for sanitizing SVG - and I took some deeper look into bug 1. For me it looks like empty-element-syntax (e.g. <polygon ... />) is not correctly processed by this code: https://github.com/OWASP/java-html-sanitizer/blob/main/src/main/java/org/owasp/html/HtmlSanitizer.java#L186 If the TAGEND-token content at this point is /> and not only >, then a new flag needs to be set: directlyCloseTagAfterOpenTag = true. Then after https://github.com/OWASP/java-html-sanitizer/blob/main/src/main/java/org/owasp/html/HtmlSanitizer.java#L194 an if-block is needed checking the flag - if true call directly receiver.closeTag(...) and reset the flag. This way empty-element-syntax always leads to openTag and closeTag called - otherwise only openTag is called and this TagBalancing logic will call closeTag at incorrect point.