aantron / markup.ml

Error-recovering streaming HTML5 and XML parsers
http://aantron.github.io/markup.ml
MIT License
146 stars 16 forks source link

Added failing test for <rb> tag without surrounding <ruby> #56

Closed sagotch closed 3 years ago

sagotch commented 3 years ago

I have no idea of why this one raises an error, but I am pretty sure it should not.

No idea if the test is put in the right place either, feel free to rewrite it as you wish.

If you have an idea of where to look at to fix this, I can try, but I am completely lost right now.

aantron commented 3 years ago

The parser case for <rb> elements has a bug. It does not call its continuation mode' if a <ruby> element is not in scope:

https://github.com/aantron/markup.ml/blob/7e2adca8389543b8a7335a12c57bff38a797640a/src/html_parser.ml#L1972-L1981

It's basically a missing else, but I think the solution will be a bit more convoluted than that because of the spec and the use of CPS.

The result of not calling the continuation is that the asynchronous parser "hangs." The synchronous wrapper fails to see the parser terminate, and raises Markup.Synchronous.Not_synchronous.

I will add a fix tomorrow. The case for <rt> elements below this one also has the bug.

It also looks like the spec has changed so that <rp> should be processed like <rt> rather than like <rb> since I wrote the code, or else that had always been a bug in Markup.ml, too.

aantron commented 3 years ago

The linked commit makes the parser not hang, and behave according to the spec, as I understood it.

Note that <rb>a<rt>b is parsed as <rb>a<rt>b</rt></rb> rather than <ruby><rb>a</rb><rt>b</rt></ruby>. I don't know if the first parse makes any kind of sense, but it seems that the spec doesn't direct the parser to insert a <ruby> element if one is not present in the input at all, which then doesn't trigger generating implied end tags, which would close the <rb> tag before the <rt> start tag.

If you'd like to double check that, click here and use Ctrl/Mac+F to search for the text "rb". The case of interest is labeled A start tag whose tag name is one of: "rb", "rtc".

aantron commented 3 years ago

The rules for parsing fragments also don't seem to assume that the fragment's context is <ruby> if the fragment begins with <rb>:

sagotch commented 3 years ago

Thank you!