HTML document parsing should handle malformed input

I agree, and it's a trivial change to make the parser ignore the end tag name, however that can result in potentially unexpected behaviour.

Take for example this HTML

<body>
  <div>
    lorem
    <span>ipsum <!-- missing </span> -->
    dolor
  </div>
  sit
</body>

By ignoring the mismatched end tag 'ipsum', 'dolor', and 'sit' all end up in the <span> because the entire document is nested one too many times. A different approach might be to assume an extra matching end tag once a mismatch is encountered like so

<body>
  <div>
    lorem
    <span>ipsum <!-- missing </span> -->
    dolor
    </span> <!-- Added by the parser when </div> was encountered, since it doesn't match <span> -->
  </div>
  sit
</body>

With the second approach 'ipsum' and 'dolor' are both still in the span, but now the rest of the document should be at the correct indentation level, and 'sit' is once again in the <body> tag.

One approach might be produce better results in the average case over random pages on the internet, but I don't think either is technically "more correct".

So I think that means malformed input handling needs to be extensible somehow, so different types of handling can be selected by users and added to this project over time.

I'll try to put up a quicker fix for the first approach since it should be trivial even if it's not my favourite.

James-LG / Skyscraper

HTML document parsing should handle malformed input #13