JuliaWeb / Gumbo.jl

Julia wrapper around Google's gumbo C library for parsing HTML
Other
154 stars 25 forks source link

Strange behaviour with autoclosing tags #84

Closed BenjaminGalliot closed 4 years ago

BenjaminGalliot commented 4 years ago

Hello!

I saw this strange behaviour, which can be useful sometimes but also rather dangerous in many cases: some tags which must not be autoclosing but autoclosed propagate until the parent closing tag, changing the tree structure.

test = """<p>A simple <em>paragraph</em> with <br/> a <b>bad</b> <a href="ref"/>link <em>(which does not exist)</em>!</p>"""
doc = parsehtml(test, preserve_whitespace=true)
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <p>
      A simple
      <em>
        paragraph
      </em>
      with
      <br></br>
      a
      <b>
        bad
      </b>

      <a href="ref">
        link
        <em>
          (which does not exist)
        </em>
        !
      </a>
    </p>
  </body>
</HTML>

I think that being more conservative and just putting the closing tag just after, without comprising the following text, is more secure.

I think this result (I write myself for example) would have been more consistent:

…
      with
      <br></br>
      a
      <b>
        bad
      </b>

      <a href="ref"></a>
        link
        <em>
          (which does not exist)
        </em>
        !
…

Another example, more visible:

test = """<p>A simple <em>paragraph</em> with <br/> a <b/>bad bold and a bad <a href="ref"/>link <em>(which does not exist)</em>!</p>"""
doc = parsehtml(test, preserve_whitespace=true)
HTML Document:
<!DOCTYPE >
<HTML>
  <head></head>
  <body>
    <p>
      A simple
      <em>
        paragraph
      </em>
      with
      <br></br>
      a
      <b>
        bad bold and a bad
        <a href="ref">
          link
          <em>
            (which does not exist)
          </em>
          !
        </a>
      </b>
    </p>
  </body>
</HTML>

I don’t know if it is a bug or a feature, but in the latter case, maybe an argument to change this behaviour at will would be nice.

Thank you for your work, anyway!

aviks commented 4 years ago

Is this something we can control? This looks like something that the underlying gumbo C library does? In which case there is not much we can do about it I suppose.

BenjaminGalliot commented 4 years ago

I found this old topic. It seems we cannot do much on our level, sadly… And in my case, the bad formed HTML also comes from ebooks!

aviks commented 4 years ago

Yeah, I read the linked issue, and the maintainers are pretty clear in their stance. I will close this as "out of scope". Thanks for raising the issue and researching it -- its good to have this documented.