OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
843 stars 213 forks source link

Nested lists get sanitised incorrectly #209

Open pcelentano opened 3 years ago

pcelentano commented 3 years ago

An invalid list, still parsed correctly by browsers and email clients of the sort:

<ul>
 <li>Topic 1</li>
 <ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <ul>
   <li>Sub 1 </li>
   <li>Sub 2</li>
   <li>Sub 3</li>
  </ul>
  <li>Item 3</li>
 </ul>
</ul>

Gets parsed as:

<ul>
  <li>Topic 1</li>
  <li>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>
        <ul>
          <li>Sub 1 </li>
          <li>Sub 2</li>
          <li>Sub 3</li>
        </ul>
      </li>
      <li>Item 3</li>
    </ul>
  </li>
</ul>

which is definitely not the intention of the original html. The proper desired output would be:

<ul>
  <li>Topic 1
    <ul>
      <li>Item 1</li>
      <li>Item 2
        <ul>
          <li>Sub 1 </li>
          <li>Sub 2</li>
          <li>Sub 3</li>
        </ul>
      </li>
      <li>Item 3</li>
    </ul>
  </li>
</ul>

Another proper desired solution would be to allow a change in this tag balancing behaviour. I see HtmlElementTables.impliedElements handles this logic and there is even this comment

    // If we require above that all <li>s appear in a <ul> or <ol> then
    // for symmetry, we require here that all content of a <ul> or <ol> appear
    // nested in a <li>.
    // This does not have the same security implications as the above, but is
    // symmetric.

But there is no way of overriding this behaviour

mikesamuel commented 3 years ago

If the sanitizer produces structurally invalid markup, there's a greater risk of HTML parsers getting confused so I can't do that.

If I allow overriding umpteen behaviors, then testing with a handful of configurations doesn't give confidence in most of the possible configurations.

If I were to add a class to elements synthesized in this way, then you could try to fix it up with some CSS.

<style>
  li.synthesized { list-style: none }
</style>

<ul>
 <li>Topic 1</li>
 <li class="synthesized"><ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <li class="synthesized"><ul>
   <li>Sub 1 </li>
   <li>Sub 2</li>
   <li>Sub 3</li>
  </ul></li>
  <li>Item 3</li>
 </ul></li>
</ul>

Unfortunately, my CSS isn't strong enough to know whether there's a way to disable list-style when the only child of an <li> is a <ul> or <ol>. Something like the below would work, but isn't supported in CSS IIRC:

li:has(> ul:only-child), li:has(> ol:only-child) {
  list-style: none
}
simon-greatrix commented 3 years ago

It seems to me that arguing that an invalid meaning has a precise meaning is twisting the meaning of "invalid". Within a <ul> or <ol> all non-whitespace content has to be inside a <li> tag. Upon encountering content outside of a <li> there are four possibilities:

1) Discard it 2) Surround it in a <li> 3) Add it to the previous <li> if there is one, otherwise do what? 4) Add it to the succeeding <li> if there is one, otherwise do what?

Obviously none of these are definitively correct, because the input is invalid. Furthermore, assuming a tag may extends past its explicit end as @pcelentano argues for seems a perverse treatment of a valid tag.

In this case option 2 is repeatable everywhere and does not discard potentially important content. I believe the current implementation provides the best solution possible.