Unibeautify / sparser

A framework of various language parsers
Other
93 stars 5 forks source link

Misleading HTML Unfinished Tags #84

Closed FlorianRappl closed 5 years ago

FlorianRappl commented 5 years ago

I came to your project via HN - quite impressive and nicely done, good job!

I tried playing around with HTML and was surprised that it seems to have some of the spec implemented correctly (e.g., <ul><li><li></ul> (apparently due to #68) being correct. However, trying the same with <p><p> did not work so nicely.

Consider the following fully legit HTML:

<!doctype html>
<html lang="en">
<title>Example</title>
    <p>
    <p>
</html>

There are two paragraphs in there as paragraph are also automatically closed via the W3C spec. The W3C validator also sees no issues at all here.

Note: There are many more such cases (li and p are just two of them, button, menuitem, ... the list is long and the rules are partially not so straight forward) - I am not sure if you want to include them all (speaking of my own experience that is quite some endeavour) or if the solution would be to be more tolerant in that direction.

Keep up the good work!

prettydiff commented 5 years ago

Good catch. I will working on unfinished <p> tags. The p and li tags are unique because they can exist as start tags with child nodes without needing a closing tag. Other tags like <br>, <hr>, and so forth are different in that either they do not contain child nodes or if they do they require a closing tag.

Specifically with menuitem it was deprecated in HTML 3.2 and so it will not get any special treatment provided by HTML 5.

I will go through the tag list and ensure I am supporting things correctly.

FlorianRappl commented 5 years ago

Yes, it could be that the closing tag can be omitted (<option> is another example for this behavior), but that depends on the context and you can construct examples where it's not allowed to omit the closing tag.

On the other hands elements like <br> are not allowed to have closing tags. If they would be written like <br></br> its an actual validation error. The self closing (XML) notation (e.g., <img/>) is only allowed (tolerated) on the self-closing tags - but should actually be avoided. Other tags, e.g., <div> cannot be self-closed at all (so <div/> is an error as well).

prettydiff commented 5 years ago

These are the tags I am supporting as elements possibly containing child nodes without end tags: https://github.com/Unibeautify/sparser/blob/master/lexers/markup.ts#L20-L36

These are the test cases the validate resolution: