SWI-Prolog / packages-sgml

The SWI-Prolog SGML/XML/HTML parser
4 stars 10 forks source link

Parser does not correctly parse empty comments #25

Closed thetrime closed 5 years ago

thetrime commented 5 years ago

Based on my reading of https://www.w3.org/TR/xml/#sec-comments it would seem that this is a valid (but empty) comment:

<!---->

Since there's just a requirement that we read 0 or more non-hyphen characters (or rather, 0 or more sets of characters which do not comprise two consecutive hyphens) followed by two hyphens and a close bracket.

However, the SGML parser doesn't seem to work this way - instead, when parsing a document like

<!----><foo></foo>

it seems to think the comment is not closed:

?- open_string("<!----><foo></foo>", S), load_structure(S, Term, []).
ERROR: SGML2PL(sgml): []:1: Unexpected end-of-file in comment
S = $stream_reference('<stream>(0x7fa9cab6c5a0)'),
Term = [].

This can be corrected pretty easily if we go from S_CMTO directly to S_CMT after consuming a hyphen. Currently we go into some intermediate state S_CMT1, which then immediately moves into S_CMT after consuming (any) character - I'm not sure if there's a motivation for that or if it's just a bug.

I'll provide a pull request shortly