commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.89k stars 318 forks source link

Should the whitespaces before backslash hard line break be removed? #724

Open chenzhiguang opened 2 years ago

chenzhiguang commented 2 years ago

In other words, should parse the backslash and the proceeding whitespace as a hard line break as a whole. or leave the proceeding whitespaces and only parse the backslash as a hard line break?

for example, should parse

a  \
b

into

a<br />b

or

a <br />b

The two or more spaces hard line break is clear, all the whitespaces before the line ending represent a hard line break, but I didn't find any mention of backslash hard line break anywhere.

wooorm commented 2 years ago

As far as I am aware this is not explained somewhere in the spec. I don’t think it needs to. In my mind, it’s similar to putting anything else at the end of a line, such as:

a  &amp;
b

->

<p>a  &amp;
b</p>

Or:

a  b
c
<p>a  b
c</p>
chenzhiguang commented 2 years ago

But it still maters in some situation, for example

  1. two or more spaces hard line break, 4 spaces before the line ending.
a    
b

Output to AST:

[
  {
    "type": "paragraph",
    "start": {
      "line": 0,
      "column": 0,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 1,
      "offset": 7
    },
    "children": [
      {
        "text": "a",
        "start": {
          "line": 0,
          "column": 0,
          "offset": 0
        },
        "end": {
          "line": 0,
          "column": 1,
          "offset": 1
        }
      },
      {
        "type": "hardLineBreak",
        "start": {
          "line": 0,
          "column": 1,
          "offset": 1
        },
        "end": {
          "line": 0,
          "column": 5,
          "offset": 5
        },
        "markers": [
          {
            "start": {
              "line": 0,
              "column": 1,
              "offset": 1
            },
            "end": {
              "line": 0,
              "column": 5,
              "offset": 5
            },
            "text": "    "
          }
        ]
      },
      {
        "text": "b",
        "start": {
          "line": 1,
          "column": 0,
          "offset": 6
        },
        "end": {
          "line": 1,
          "column": 1,
          "offset": 7
        }
      }
    ]
  }
]

The offset from 1 to 4 hit the hardLineBreak marker.

  1. backslash hardline break, a backslash proceeded by 3 spaces:
    a   \
    b

    If we do not count these proceeding spaces as part of the hard line break, the AST output will be:

[
  {
    "type": "paragraph",
    "start": {
      "line": 0,
      "column": 0,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 1,
      "offset": 7
    },
    "children": [
      {
        "text": "a   ",
        "start": {
          "line": 0,
          "column": 0,
          "offset": 0
        },
        "end": {
          "line": 0,
          "column": 4,
          "offset": 4
        }
      },
      {
        "type": "hardLineBreak",
        "start": {
          "line": 0,
          "column": 4,
          "offset": 4
        },
        "end": {
          "line": 0,
          "column": 5,
          "offset": 5
        },
        "markers": [
          {
            "start": {
              "line": 0,
              "column": 4,
              "offset": 4
            },
            "end": {
              "line": 0,
              "column": 5,
              "offset": 5
            },
            "text": "\\"
          }
        ]
      },
      {
        "text": "b",
        "start": {
          "line": 1,
          "column": 0,
          "offset": 6
        },
        "end": {
          "line": 1,
          "column": 1,
          "offset": 7
        }
      }
    ]
  }
]

This way, only offset 5 is the hard line break marker.

There might be no difference when rendered to HTML. but in markdown editor, it might matter.

wooorm commented 2 years ago

If you have a problem with an AST, this is not the place to report it. This spec does not define ASTs.

chenzhiguang commented 2 years ago

I meant I do not have a clear specification to follow when parsing the backslash hard line break to an AST. whether or not count the spaces before the backslash as a part of the hard line break will output different ASTs

wooorm commented 2 years ago

I think this is a problem in your AST, and unrelated to this specification. a) I believe the tool generating the AST should remove the whitespace: the text should be 'a', not 'a ' b) the tool generating markdown from the AST should prefer hard breaks with a backslash too, they’re more clear and have a higher chance of working now that CommonMark is basically everywhere, as editor tooling will typically remove trailing whitespace.

I don’t believe there is anything that has to happen in this project.

If you’re interested in AST tools that do generate such as AST, and AST tools tools that do serialize with backslashes, you might find my projects mdast, mdast-util-from-markdown, and mdast-util-to-markdown useful.

chenzhiguang commented 2 years ago

Thanks a lot! Yep, the trailing whitespace should always be removed if there is not a specific reason.

This is my project dart_markdown, a Markdown to AST parser, which is definitely inspired by your mdast.