Broken HTML parsing - Githubissues

commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript

http://commonmark.org

Other

4.87k stars 313 forks source link

Broken HTML parsing #597

Open fabiospampinato opened 5 years ago

fabiospampinato commented 5 years ago

I have the following 2 snippets:

<script>
    Foo

    Bar
</script>

<div>
  <script>
    Foo

    Bar
  </script>
</div>

Which are rendered respectively as:

<script>
    Foo

    Bar
</script>

<div>
  <script>
    Foo
<pre><code>Bar
</code></pre>
  </script>
</div>

On how to parse these the spec says, here:

Start condition: line begins with the string <script, <pre, or <style (case-insensitive), followed by whitespace, the string >, or the end of the line. End condition: line contains an end tag , , or (case-insensitive; it need not match the start tag).

This makes sense to me, pre, style and script tags often contain empty lines and should be parsed correctly.

On how to parse the second snippet though the spec mentions that since the snippet started with <div> then the exit condition for the entire thing will be an empty line.

How does this make any sense?

Why shouldn't script's rule takes precedence for the lines wrapped between <script> and </script>?

jgm commented 5 years ago

This tracker is for bug reports. Discussion and questions like this should go to the forum at talk.commonmark.org. Feel free to open a topic there, after searching for existing relevant discussions. (But I think if you just read the whole section on HTML blocks in the spec, you'll find your question answered.)

jgm commented 5 years ago

Oh, I see you're not asking why in general a blank line ends an HTML block, but why the inner script tag's rule doesn't take precedence. Well, if you like you can bring this up in the forum. The rule is fairly simple; it won't automatically do what a human would expect in every case. We could talk about specific ways to change it (but please, nothing that requires unlimited backtracking).

fabiospampinato commented 5 years ago

@jgm How is this not a bug report since script tags aren't parsed correctly?

The rule is fairly simple; it won't automatically do what a human would expect in every case.

I agree that it's pretty simple, but since no human would parse HTML in their heads that way I would consider it broken. Plus there's a specific rule for parsing script, pre and style tags containing empty lines, but it breaks down pretty quickly.

I'm not too familiar with parsers to make a detailed proposal about this, but roughly I'd say the rule for script, pre and style tags should just take the precedence inside their blocks.

Can we please reopen this?

aidantwoods commented 5 years ago

@fabiospampinato As you noted, the HTML block started by <div> requires a blank line to end it, and HTML blocks are leaf blocks (cannot contain other blocks) so the line <script> never starts a HTML block.

@jgm & @fabiospampinato one idea for making this parse more "intuitively" might be to reassess the start condition on lines that don't meet the end condition and adjust the HTML block type accordingly in some order of precedence. E.g. if a subsequent line of a type 6 or 7 HTML block (which don't have, let's say "intuitive", end conditions in some cases) could start a type 1 through 5 HTML block then the HTML block will change to the respective type. Another way to phrase this might be that HTML blocks of type 1 through 5 may interrupt a type 6 or 7 HTML block.

@fabiospampinato if you open something over on the forums, we can discuss pros and cons of, and ideas for an adjusted HTML block definition there that bears this example in mind?

fabiospampinato commented 5 years ago

so the line Githubissues.
Githubissues is a development platform for aggregating issues.