commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.89k stars 317 forks source link

Non-alphanumeric with format is not properly parsed when connected to an alphanumeric string #773

Open tomerlichtash opened 3 months ago

tomerlichtash commented 3 months ago

String with non-alphanumeric formatted content which has a next-char of an alpha-numeric is tokenized as text node, instead of into a series of format nodes as expected.

Problem reproduced on CommonMark online demo (to reproduce just paste **@**A there and compare with **@** A).

Example: While all these samples are tokenize as expected:

**@**@ => formatted non-alphanumeric + non-alphanumeric
@**@** => non-alphanumeric + formatted non-alphanumeric
@**A** => formatted non-alphanumeric + non-alphanumeric
**A** @ => formatted alphanumeric + space + non-alphanumeric

This sample will be tokenized into a text node and will not be parsed: **@**A (formatted non-alphanumeric + alphanumeric)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">

<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>**</text>
    <text>@</text>
    <text>**</text>
    <text>A</text>
  </paragraph>
</document>

Add a space between formatted non-alphanumeric and alpha-numeric and compare tokenization for string **A** @:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">

<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <strong>
      <text>@</text>
    </strong>
    <text> A</text>
  </paragraph>
</document>
jgm commented 3 months ago

Are you claiming that the parser doesn't properly implement the spec, or are you suggesting a change to the spec? If the latter, please examine the current rules and be specific about the change you'd recommend, recognizing that any change that "fixes" this case may break other things.

Unfortunately, the way commonmark / Markdown is designed, it is difficult to avoid some "blind spots" like this. See my essay Beyond Markdown, item 1. My project https://djot.net attempts to fix some of these issues.