James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions
MIT License
31 stars 4 forks source link

HTML document parsing should handle malformed input #13

Closed Mordinel closed 1 year ago

Mordinel commented 1 year ago

As this works right now, if the parser is fed a malformed or an incomplete HTML document, we get an error such as EndTagMismatch.

If this is to be used for real-world web scraping, it needs to be able to parse imperfect HTML structures, given that HTML encountered on the web is often malformed or incomplete in various ways.

To handle these scenarios, the parser should gracefully fail in such a way that allows users of the parser to access and extract valuable information from partially parsed documents.

James-LG commented 1 year ago

I agree, and it's a trivial change to make the parser ignore the end tag name, however that can result in potentially unexpected behaviour.

Take for example this HTML

<body>
  <div>
    lorem
    <span>ipsum <!-- missing </span> -->
    dolor
  </div>
  sit
</body>

By ignoring the mismatched end tag 'ipsum', 'dolor', and 'sit' all end up in the <span> because the entire document is nested one too many times. A different approach might be to assume an extra matching end tag once a mismatch is encountered like so

<body>
  <div>
    lorem
    <span>ipsum <!-- missing </span> -->
    dolor
    </span> <!-- Added by the parser when </div> was encountered, since it doesn't match <span> -->
  </div>
  sit
</body>

With the second approach 'ipsum' and 'dolor' are both still in the span, but now the rest of the document should be at the correct indentation level, and 'sit' is once again in the <body> tag.

One approach might be produce better results in the average case over random pages on the internet, but I don't think either is technically "more correct".


So I think that means malformed input handling needs to be extensible somehow, so different types of handling can be selected by users and added to this project over time.

I'll try to put up a quicker fix for the first approach since it should be trivial even if it's not my favourite.