Vereyon / HtmlRuleSanitizer

A rule based HTML sanitizer built on top of the HTML Agility pack
MIT License
63 stars 19 forks source link

Invalid HTML output when using void tags #37

Open NatalGi opened 9 months ago

NatalGi commented 9 months ago

In case of cleaning HTML as following, sanitizer generates invalid HTML output [closing tags are missing]. Perhaps this is a problem with void tags. I used HtmlSanitizer.SimpleHtml5Sanitizer() in this example.

INPUT

<p><img src="./x.jpg"></p>
<p><img src="./y.jpg"></p>
<p><img src="./z.jpg"></p>
<p>Tekst<br></p>
<p><svg viewBox="0 0 120 120" xmlns="http://www.w3.org/2000/svg"><rect x="10" y="10" width="100" height="100" rx="15"/></svg></p>
<p><input type="text"></p>

OUTPUT

<p><p><p><p>Tekst<br></p><p><p>

Link to .NETFiddle with this case -> https://dotnetfiddle.net/mvVdbQ

cakkermans commented 9 months ago

Hi @NatalGi, I was a bit puzzled by your question; it looks wrong, but the resulting HTML seems to work fine when I test it in a browser and inspect the DOM. This seemed odd to me. So I checked the HTML5 specifications, and found the following in section 13.1.2.4 Optional tags:

A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, menu, nav, ol, p, pre, search, section, table, or ul element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element, or an autonomous custom element.

So it seems like the output is perfectly fine from a technical point of view: you really can have an empty <p> element by just writing <p> and not closing it.

I haven't really made up my mind about this, because as I said I think it doesn't look nice.

NatalGi commented 9 months ago

The resulting HTML indeed work fine in a browser and generated DOM looks as expected.

cakkermans commented 9 months ago

I did find an option HtmlDocument.OptionWriteEmptyNodes on the backing HTML parser, which when set to true will make the output of your sample looks as follows:

<p /><p /><p /><p>Tekst<br /></p><p /><p />

This output looks less odd to me.

NatalGi commented 9 months ago

@cakkermans You answered arlier that p element's end tag can be omitted if it is followed by one of the tags you mentioned, and in my previous example the DOM generated by the browser was correct, but I have another example.

This example also works in the browser, but now, the nesting of tags in the DOM is different than in input HTML. I put svg after p tag, that is not in the mentioned list, and the sanitizer also omits p end tag before it.

INPUT

<p><img src="./x.jpg"></p>
<p><img src="./y.jpg"></p>
<svg viewBox="0 0 120 120" xmlns="http://www.w3.org/2000/svg">
    <rect x="10" y="10" width="100" height="100" rx="15"/>
</svg>
<p><img src="./z.jpg"></p>
<p>Tekst<br></p>        

OUTPUT

<p><p><svg xmlns="http://www.w3.org/2000/svg"><rect x="10" y="10" width="100" height="100" rx="15"></rect></svg><p><p>Tekst<br></p>

DOM image

Link to .NETFiddle with this case -> https://dotnetfiddle.net/d86PlL

cakkermans commented 8 months ago

Hi @NatalGi, I looked a bit further into the issue you reported. I suspect that this might be an issue with the HTML parser I used for the sanitizer. I have reduced the issue to sanitizing the following HTML code:

<p><svg />

With HTML Agility Pack 1.7.1 the following HTML it output:

<p><svg />

With HTML Agility Pack 1.11.57 the following HTML is output:

<p><svg></svg></p>

This is a lot more in the direction of what would be expected. This leads me to conclude that most likely the HTML Agility Pack had a bug in correctly dealing with optional end tags. I am going to check in which HTML Agility Pack version this is fixed and bump the version required by HTMLRuleSanitizer.