fb55 / htmlparser2

The fast & forgiving HTML and XML parser
https://feedic.com/htmlparser2
MIT License
4.45k stars 377 forks source link

How should `<1` be parsed? #703

Closed georgecrawford closed 3 years ago

georgecrawford commented 3 years ago

Hi,

Forgive my ignorance, but I'm using this library (via sanitize-html, but that's irrelevant) and I was surprised to see that a string of <1 people affected was 'sanitized' to an empty string. The issue lies in htmlparser2 as far as I can see, in that no events are fired except onend when parsing this string, so no text can be captured.

If I set the HTML of a document to <1 people affected, browsers treat it as invalid HTML and display the same string as plain text. What is the expected behaviour of htmlparser2 in this case? If it's not designed to work with invalid HTML, is there a way that I can determine that the HTML is invalid, or in some other way reproduce what browsers tend to do?

Since this is not a dangerous string to display in a browser, I would like sanitize-html to return the original string with no changes, but it can't do that if htmlparser2 doesn't call any events.

I'm very willing to admit that I've missed the point of one or more of these libraries, so please feel free to help me understand!

georgecrawford commented 3 years ago

Aah, I've just found https://github.com/apostrophecms/sanitize-html/issues/79 and https://github.com/fb55/htmlparser2/issues/156#issuecomment-590067942, which together led me to upgrade sanitize-html and this is now fixed. Sorry not to have tested with the latest version!