Emphasis with CJK punctuation

ptmkenny commented 4 years ago

Hi, I encountered some strange behavior when using CJK full-width punctuation and trying to add emphasis.

Example punctuation that causes this issue:

。！？、

To my mind, all of these should work as emphasis, but some do and some don't:

**テスト。**テスト

**テスト**。テスト

**テスト、**テスト

**テスト**、テスト

**テスト？**テスト

**テスト**？テスト

I'm not sure if this is the spec as intended, but in Japanese, as a general rule there are no spaces in sentences, which leads to the following kind of problem when parsing emphasis.

In English, this is emphasized as expected:

This is **what I wanted to do.** So I am going to do it.

But the same sentence emphasized in the same way in Japanese fails:

これは**私のやりたかったこと。**だからするの。

tats-u commented 7 months ago

This and the above issues are caused by the change in #618. It is mixed in only v0.30 spec.

https://spec.commonmark.org/0.30/changes

A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) not followed by a Unicode punctuation character, or (2b) followed by a Unicode punctuation character and preceded by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A single * character can open emphasis iff (if and only if) it is part of a left-flanking delimiter run.

A single * character can close emphasis iff it is part of a right-flanking delimiter run.

A double ** can open strong emphasis iff it is part of a left-flanking delimiter run.

A double ** can close strong emphasis iff it is part of a right-flanking delimiter run.

The definition of left- and -right-franking emphasis for * and ** must use ASCII punctuation characters instead of Unicode ones.

https://v1.mdxjs.com/

does not cause such problem, so remark depended by MDX v2+ is affected.

wooorm commented 7 months ago

Again, there is no change in 618. That PR is just about words, terminology.

MDX 1 did not follow CM correctly and had other bugs.

Can you please read what I say, and please stop spamming, and actually contribute?

tats-u commented 7 months ago

MDX 1 did not follow CM correctly and had other bugs.

The extension by MDX is not the culprit.

https://codesandbox.io/s/remark-playground-wmfor?file=/package.json

As of remark-parse v7, this problem is not reproduced, either.

https://prettier.io/playground/#N4Igxg9gdgLgprEAuEAqVhT00DTmg8qMHYMgZtGBSKoOGmgQAzrEl6A7EYM2xZIANCBAA4wCW0AzsqAEMATkIgB3AArCEfFAIA2YgQE8+LAEZCBYANZwYAZQEBbOABlOUOMgBmCnnA1bd+g222WA5shhCAro4gDsacPv6BPF7ycACKfhDwtvaBAFY8AB4GUbHxiUh28g4sAI65cBKibLIgAjwAtFZwACbNzCC+ApzyXgDCEMbGAsg18vJtkVCe0QCCML6c6n7wEnBCFlZJhYEAFjDG8gDq25zwPO5gcAYyJ5wAbifKw2A8aiC3AQCSUC2wBmBCnA402+BhgymimyKIDYogcBy0bGGMLgDiEt2sLEsqJgFQEnkGkMC7iEqOGgyEOia4igbRhlhgB04TRg22QAA4AAwsIRwUqcHm4-FDfLJFgwATqRnM1lIABMLD8DgAKhLZAUoXBjOpmi0mmYBJM-Hi4AAxCBCQZzLzDARLCAgAC+DqAA

Not reproduced in The latest Prettier (uses remark-parse v8), either.

That PR is just about words, terminology.

This means that the credit for the change goes to the fact that it turns to be clear that this specification is a terrible one that should be revised. Old remark-parse were based on an older ambiguous specification and consequently avoided this problem.

tats-u commented 7 months ago

https://spec.commonmark.org/0.29/

A punctuation character is an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

You are right. I'm sorry. I will look for another version.

tats-u commented 7 months ago

I finally found that the current broken definition sentences were introduced in 0.14.

https://spec.commonmark.org/0.14/changes

https://spec.commonmark.org/0.13/

I will investigate why these are introduced.

tats-u commented 7 months ago

https://github.com/commonmark/commonmark-spec/blob/0.14/changelog.spec.txt

Improved rules for emphasis and strong emphasis. This improves parsing of emphasis around punctuation. For background see http://talk.commonmark.org/t/903/6. The basic idea of the change is that if the delimiter is part of a delimiter clump that has punctuation to the left and a normal character (non-space, non-punctuation) to the right, it can only be an opener. If it has punctuation to the right and a normal character (non-space, non-punctuation) to the left, it can only be a closer. This handles cases like
**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**
and
**foo "*bar*" foo**

http://talk.commonmark.org/t/903/6

There are some good ideas here 4. It looks hairy, but if I understand correctly, basic idea is fairly simple:

Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things: the character immediately before them and the character immediately after.

Left-flanking delimiters can open emphasis, right flanking can close, and non-flanking delimiters are just regular text.

A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking: spaces and newlines are 0, punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1, the rest 2. And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

[!NOTE] I replaced the link with a cache by the Wayback machine.

I conclude that this problem was caused by a lack of consideration for Chinese and Japanese by @jgm and the author of vfmd(@roop or possibly @akavel).

tats-u commented 7 months ago

I would like to ask them why they included non-ASCII punctuation characters and why only ASCII punctuation characters are not sufficient.

tats-u commented 7 months ago

I will blame https://github.com/vfmd/vfmd-spec/blob/gh-pages/specification.md later.

The test cases in vfmd considered only ASCII punctuation.

https://github.com/vfmd/vfmd-test/blob/f74cf615198f788a99f14975cc14a59b1cd3b8fe/tests/span_level/emphasis/with_punctuation.md

tats-u commented 7 months ago

I found the commit containing the initial definition in the spec of vfmd:

https://github.com/vfmd/vfmd-spec/commit/7b53f05fa3f54fe993e8c63cbece55eda2e810c9

@roop seems to live in India, and this may be because he added non-ASCII punctuation characters, but the trouble is that I do not know Hindi at all. I wonder if a space is always adjacent to punctuation characters in that language like European ones.

vassudanagunta commented 7 months ago

@tats-u dude, here and in your comments on #618 you come off as arrogant and very disrespectful. You make absolutist claims and then frequently correcting yourself because it turns out you didn't do your homework. You need to have the humility to realize that your perception that "something broke or is broken" might have to do with you not understanding one or more of the following (I don't have the time to figure out which ones, the responsibility is on you):

your specific perspective, which may not be universal, which may miss the forest for the single tree that you are most focused on
the problem, if there actually is one, might be downstream of CommonMark, in the tool you are using
if CommonMark is involved:
- the facts, the history, or the priorities of CommonMark
- the impossible expectation that CommonMark can be all things to all people.
- the difficulty in maintaining a spec where many users expect it to work how they want it without understanding

A more reasoned, respectful and helpful approach would be to have a discussion with other people who are affected by what you claim is broken, including the makers and other users of the downstream tool that you claim is now broken. Diagnose the problem with them, assuming they agree with you that there is a problem, before making a claim that the source of the problem is upstream in CommonMark.

If it turns out that you are alone in this, that should tell you something.

wooorm commented 7 months ago

@tats-u This issue is still open, so indeed it is looking for a solution. It is also something I have heard from others.

However, it is not easy to solve. Many languages do use whitespace. No languages use only ASCII. Not using unicode would harm many users, too.

There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages.

One idea I have, that could potentially help emphasis/strong, is the Unicode line breaking algorithm: https://unicode.org/reports/tr14/. It has to be researched, but it might come up with line breaking points that are better indicators than solely relying on whitespace/punctuation. It might also be worse.

tats-u commented 7 months ago

@vassudanagunta I had got too much angry at that time. I do think it was over the limit now. ~~I wish GitHub would provide the draft comment feature out of box, and I could post many things at once without editing or additional ones.~~

the problem, if there actually is one, might be downstream of CommonMark, in the tool you are using

Let me say there are never in each framework. This problem can be reproduced in the most major JS Markdown frameworks, remark (unified) and markdown-it. Remark-related issues that I have raised are closed immediately with the reason that they are on spec.

the impossible expectation that CommonMark can be all things to all people.

I never have. This is why I have looked into the background and the impact of my proposed changes now.

the difficulty in maintaining a spec where many users expect it to work how they want it without understanding

It looks like a lot of work to study the impact of breaking changes and decide whether or not to apply them.

many users expect it to work how they want it without understanding

Due to this problem, it became necessary for me (us) to tell all Japanese (and some Chinese) Markdown writers to refrain from surrounding whole sentences with **, to use JSX , or to compromise with adding an extra space after the full-width punctuation marks 。 and ． if they are going to continue additional sentences.

<!-- What would you feel if Markdown would not recognize ** here as <strong> if you remove 4 or 5 spaces?   -->
**Don't surround the whole sentence with the double-asterisk without adding extra spaces!**      The Foobar language which is spoken by most CommonMark maintainers use as many as 6 spaces to split sentences.

the facts, the history

This is what I have looked into by digging through rummaging through the Git history, change logs, and test cases now.

the priorities of CommonMark

It is not surprising that maintainers and you lower the priority of this problem, since it does not affect any European language family, which puts space next to punctuation or parentheses. I had got angry because I assumed that Japanese and Chinese were not even seen as third-class citizens in the Markdown world due to the background of this problem. (The change causing this problem assumes that all languages puts space next to punctuation or parentheses)

If it turns out that you are alone in this, that should tell you something.

I clearly doubt this. You had better know many of users of specific languages (and they are not minor ones!) are (going to be) suffered by this problem.

@wooorm I apologize again at this time for my anger and for being too militant in my remarks.

My humble suggestions and comments on them:

Revert the concept of left- and right-flanking to prior to 0.14 (0.14 itself is not included)
- Old remark v8 used in Prettier, which is said to violate CM 0.14+ spec, correctly parses the cases presented in the change log in CM v0.14.
- I would like to know and have to investigate the impact of this change because it is a breaking change
Left- and right-flanking + ASCII punctuation (Unicode punctuation can be used in other parts)
- In addition to the issues you mentioned, the combination with link **[製品ほげ](./product-foo)**と**[製品ふが](./product-bar)**をお試しください still cannot be parsed as expected. Compromised solution
Left- and right-flanking + exclude Chinese- and Japanese-related punctuation from list
- Some users use ( ) “ ” without adjacent space. Compromised solution

Many languages do use whitespace.

I know. It is the background of this problem.

There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages.

I have looked for ones and their frequency. Escaping them does not modify the rendered content itself, but I have been disgusted of having to modify the content by adding extra space or to depend on the inline raw JSX tag () to avoid this problem, which puts the shackles on Markdown's expressive power.

Unicode line breaking algorithm

I will look into it later. (I do not expect it either)

Crissov commented 7 months ago

Checking the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po and Ps, U+3001 Ideographic Comma and U+3002 Ideographic Full Stop are of course included in what Commonmark considers punctuation marks, which are all treated alike.

For its definitions of flanking, CM could start to handle Open/Start Ps (e.g. () and Initial Pi (“) differently than Close/End Pe ()) and Final Pf (”), and both differently than the rest of Connector Pc (_), Dash Pd (-) and Other Po. However, this could only (somewhat) help with brackets and quotation marks or in contexts where they are present, since the characters in question are all part of that last category Po, which is the largest and most diverse by far.

Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394.

tats-u commented 7 months ago

@Crissov

Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394.

I checked the raised test cases. 367 is most affected in them. I wonder how many Markdown writers use nested  for casual documents suitable for Markdown and whether if we can ask users to combine * and _ or use the raw  powered by MDX if they want to nest . CJK languages do not use italic. They use https://en.wikipedia.org/wiki/Emphasis_mark, brackets (「」), or quotes (“”) for emphasizing words. Emphasizing parens in that case may less natural for humans but is a simpler specification and easier to expect the behavior. Japanese and Chinese do not use _-related syntax because it has too many restrictions, so 371 does not matter. You can keep the current behavior on _. Other raised cases are not affected.

However, there are some ones not raised but more important. I am not convinced in the test case 378 (a**"foo"**\n→as is). We may as well treat ** in it as . It is popular to make text bold even in Chinese and Japanese and ** is used much more frequently than *. MDN says that  can be nested but does not say that  is also nested. It would be appreciated if the behavior of ** would be changed first. It is the highest priority for Chinese and Japanese.

handle Open/Start Ps (e.g. () and Initial Pi (“) differently than Close/End Pe ()) and Final Pf (”), and both differently than the rest of Connector Pc (_), Dash Pd (-) and Other Po.

Does it not mean that ** in 単語と**[単語と](word-and)**単語 is going to be treated as  by that change, does it?

FYI, as of https://hypestat.com/info/github.com, one in six visitors in GitHub live in China or Japan. This percentage would not be able to be ignored or underestimated.

wooorm commented 7 months ago

CJK languages do not use italic.

 elements have a default styling in HTML (italic), but you can change that. You can add 「」 before/after if you want, with CSS. Markdown does not dictate italic.

MDN says that  can be nested but does not say that  is also nested.

The “Permitted content: Phrasing content” bit allow it for both.

This percentage would not be able to be ignored or underestimated.

I don’t think anybody is underestimating that. You can’t ignore all existing markdown users either, though, and break them.

Practically, this is also open source, which implies that somebody has to do the work for free here, probably because they think it’s fun or important to do. And then folks working on markdown parsers need to do it too. To illustrate, GitHub hasn’t really done anything in the last 3 years (just security vulnerabilities / new fancy footnote footnotes feature).

jgm commented 7 months ago

Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial.

For what it's worth, my rationalized syntax djot has simpler rules for emphasis, gives you what you want in the above Japanese example, and allows you to use braces to clarify nesting in cases where it's unclear, e.g. {*foo{*bar*}*}. It might be worth a look.

tats-u commented 7 months ago

 elements have a default styling in HTML (italic), but you can change that. You can add 「」 before/after if you want, with CSS.

This is technically possible but not practical or necessary. It is much easier and faster to type "「" & "」" from the keyboard directly, and you cannot copy these brackets in ::before and ::after from the text.

Markdown does not dictate italic.

Almost all description on Markdown for newbies including the following say that * is for italic.

I do not know of SaaSes in Japan that customize the style of .

The current behavior of CommonMark forces newbies in China or Japan to try to decipher its spec. It is for developers of Markdown parsers, not for users except for experts.

CommonMark has now grown to the point where it can manipulate the largest Markdown implementations (remark, markdonw-it, goldmark (used by Hugo), commonmarker (possibly used by GitHub), and so on) from behind the scenes. We may well lobby to revise its specification. (unenforceable of course though!)

It would not be difficult to create a new specification of Markdown, but is difficult to give sufficient power to it.

These are why I had tried to stop the left- and right-flanking, but I have found a convincing plan to recently.

We have only to change by my plan:

The definitions of (2a) & (2b) in the left- and right-flanking delimiter run
Example 352 & 379, which should not occur in English and other many languages that are not suffered by this problem, because a space is mostly adjacent to punctuation in them.

Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial.

We do not have to change the other. I hope most Chinese and Japanese can be convinced by it. Also, you can continue to nest  and  in other than Chinese or Japanese as you can do today. (We rarely need that feature in these languages) This will not break almost all existing documents written without abusing the details of the spec.

I don’t think anybody is underestimating that.

I am a little relieved to hear that. I apologize for the misunderstanding.

You can’t ignore all existing markdown users either, though, and break them.

It would affect too many documents if the left- & right-flanking rule were abolished. However, the new plan will not affect on most existing documents except for ones that abuse the details of the spec. Do you mean that they are also included in "all existing" ones? For the first place, this feature is just an Easter egg. A little modification of that could be accepted. I would be appreciated if you could provide me some links to famous sites that describe on Markdown for intermediate level people and that mention the  &  nesting if you have time. I could not find one.

I suggest new terms "punctuation run preceded by space" & "puncuation run followed by space".

"... preceded ..." means: a sequence of Unicode punctuation characters preceded by Unicode whitespace
"... followed ..." means: a sequence of Unicode punctuation characters followed by Unicode whitespace

(2a) and (2b) is going to be changed like the following:

A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) preceded by a Unicode whitespace, or (2b) not the first characters in puncuation run followed by space. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.
A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) followd by a Unicode whitespace, or (2b) not the last characters in puncuation run preceded by space. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This change treats punctuation characters that are not adjacent to space as normal letters. To see if the "**" works as intended, one need only check the nearest whitespace and the punctuation characters around it. It make it possible to parse all of the followings:

**これは太字になりません。**ご注意ください。

カッコに注意**（太字にならない）**文が続く場合に要警戒。

**[リンク](https://example.com)**も注意。（画像も同様）

先頭の**`コード`も注意。**

**末尾の`コード`**も注意。

Also, we can parse even the following English as intended:

You should write “John**'s**” instead.

We do not concatenate too many punctuation characters, so we do not have to search more than ten and some (e.g. 16) punctuation characters for space from the previous or next of the target delimiter run.

To check if the delimiter run is "the last characters in punctuation run preceded by space" (without using cache):

flowchart TD
    Next{"Is the<br>next character<br>an Unicode punctuation<br>chracter?"}
    Next--> |YES| F["<code>return false</code>"]
    Next--> |NO| Init["<code>current =</code><br>(previous character)<br><code>n =</code><br>(Length of delimiter run)"]
    Init--> Exceed{"<code>n >= 16</code>?"}
    Exceed--> |YES| F
    Exceed --> |NO| Previous{"What type is <code>current</code>?"}
    Previous --> |Not punctuation or space| F
    Previous --> |Space| T["<code>return true</code>"]
    Previous --> |Unicode punctuation| Iter["<code>n++<br>current =</code><br>(previous character)"]
    Iter --> Exceed

In the current spec, to non-advanced users especially in China or Japan, "*" and "**" sometimes appear to be abandoning its duties. We must not let non-advanced users write Markdown in fear of this hidden feature.

Crissov commented 5 months ago

0.31 changes the wording slightly, but as far as I can tell this does not change flanking behavior at all.

A Unicode punctuation character is …

old:

an [ASCII punctuation character] or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.
new:

a character in the Unicode P (puncuation) or S (symbol) general categories.

tats-u commented 5 months ago

The change made the situation even worse. The following sentences are now unable to be parsed properly.

税込**¥10,000**で入手できます。

正解は**④**です。

The few improvements are only that it is easier to explain the condition to beginners (we can now use the single word “symbols”) and more consistent with ASCII punctuation characters.

jgm commented 5 months ago

This particular change was not intended to address this issue; it was just intended to make things more consistent.

@tats-u I am sorry, I have not yet had time to give your proposal proper consideration.

tats-u commented 5 months ago

This particular change was not intended to address this issue; it was just intended to make things more consistent.

I guess it, but as a result it did cause a breaking change and break some documents (much less than ones affected by 0.14 though), which is a kind of regressions you have mostly feared and cared about. This change will be the basis for determining what kind of breaking changes will be acceptable in the future.

For the first place, we cannot easily access to convincing and practical examples that describe how legitimate controversial parts of specifications and changes are; we can easily find only ones that are designed only for testing and do not have meaning (e.g. *$*a. and *$*alpha.)

What is needed is like:

Price: **€**10 per month (note: you cannot pay in US$!)

I have not yet had time to give your proposal proper consideration.

FYI you do not have evaluate how optimize the algorithm in the above flowchart; it is too naive and can be optimized. All I want you to do first is to evaluate how acceptable breaking changes brought by my revision are. It might be better for me to make a PoC to make it easy to do it.

jgm commented 5 months ago

To be honest, I didn't anticipate these breaking changes, and I would have thought twice about the change if I had.

Having a parser to play with that implements your idea would make it easier to see what its consequences would be. (Ideally, a minimally altered cmark or commonmark.js.) It's also important to have a plan that can be implemented without significantly degrading the parser's performance. But my guess is that if it's just a check that has to be run once for each delimiter + punctuation run, it should be okay.

wooorm commented 5 months ago

it did cause a breaking change and break some documents

Do you have links? Or is this theoretical?

tats-u commented 5 months ago

Do you have links? Or is this theoretical?

Not yet but it will be difficult to find ones because most search engines ignore symbols. For the first place they will not account for a high percentage of those that have been affected since before.

I have to answer "theoretical" at least as of now. I can just raise some meaningful examples.

FYI it would be more desirable for meaningful but longer examples to appear in an content for advanced users rather than in the specifications.

tats-u commented 5 months ago

@jgm It looks like I can now suggest an optimized algorithm or a PoC without worry. I hope you will be patient.

jgm commented 5 months ago

One kind of text to think about is this:

Here is boldface text---**set off with em dashes**---and nested.

This is something we should certainly handle. Note that the punctuation --- does not end in a space. Ellipses are other examples of punctuation that doesn't always end in space, perhaps there are others.

jgm commented 5 months ago

I think the above text will work okay with @tats-u 's plan, but if you nest it, it won't:

*Here is some text---*and a nested emphasized text inside em dashes*---and the end of the text.*

But breaking this would be worth it if it's the only way to fix emphasis for CJK users.

tats-u commented 5 months ago

@jgm I forgot the existence of the em dash (3 or more consecutive hyphens or U+2014). However, I have a feeling that there are few other similar symbols that satisfy all of the following conditions:

Used in languages that splits words by a space and have a culture of nesting emphasis relatively frequently
Not adjacent to a space unlike other many punctuations
(Split a sentence into blocks that can be treated as equivalents to independent sentences)

I wonder if there is another one in English and at least other Indo-European languages.

yuin commented 5 months ago

goldmark author here. It seems impossible to perfect the emphasis notation where the start and end symbols are the same(I think this is one of the reasons why @jgm started the djot)

RestructuredText avoids this kind of issues by an 'escaped' space. This seems a neat way for markdown too.

太郎は\ **「こんにちわ」**\ といった

A half space preceding \ is not rendered, So above markdown will be rendered as

<p>太郎は<strong>「こんにちわ」</strong>と言った</p>

goldmark implements this markup as CJK extension(of course you can use this extension in Hugo).

https://yuin.github.io/goldmark/playground/?m=%25E5%25A4%25AA%25E9%2583%258E%25E3%2581%25AF%255C%2520**%25E3%2580%258C%25E3%2581%2593%25E3%2582%2593%25E3%2581%25AB%25E3%2581%25A1%25E3%2582%258F%25E3%2580%258D**%255C%2520%25E3%2581%25A8%25E8%25A8%2580%25E3%2581%25A3%25E3%2581%259F&o=128&v=v1.7.0

This markup is easy to implement in CommonMark parsers. You know, this will breaks compatibilities, but I suspect few.

jgm commented 5 months ago

~~The problem is that an escaped space already has a meaning in Markdown.~~ (EDIT: Sorry: it has a meaning in pandoc's Markdown, where it's used to write a nonbreaking space, but not in commonmark.)

One option for CJK, of course, is just to add a line break around the emphasized section:

太郎は
**「こんにちわ」**
といった

and then process with -f markdown+ignore_line_breaks. But I imagine that would be unpopular.

EDIT: Sorry, this is also pandoc-markdown specific.

jgm commented 5 months ago

Another approach that works already:

太郎は&#x200B;**「こんにちわ」**&#x200B;といった

200B is a zero-width space. Of course you can also just insert the unicode zero-width space instead of typing , and then the source will look clean:

太郎は**「こんにちわ」**といった

yuin commented 5 months ago

Zero-width space approach is well known workaround for this kind of issues. But It is too ugly as a Markdown(that is aimed at achieving readability as a plaintext).

An escaped space is also ugly, but readable enough as a plaintext IMHO.

rxliuli commented 5 months ago

Zero-width space approach is well known workaround for this kind of issues. But It is too ugly as a Markdown(that is aimed at achieving readability as a plaintext).

An escaped space is also ugly, but readable enough as a plaintext IMHO.

Yes, zero-width space is completely unacceptable as an existing alternative solution, even worse than extra whitespace.

tats-u commented 5 months ago

One option for CJK, of course, is just to add a line break around the emphasized section:

You should not at least as of now. Browsers other than Firefox have a bug to insert a extra space per newline like there, which is incompatible with web-platform-tests.

You may want to check the test cases for those Chrome & Safari fail in https://wpt.fyi/results/css/css-text/line-breaking?label=master&label=experimental&aligned&q=segment-break-transformation-rules-

200B is a zero-width space. Of course you can also just insert the unicode zero-width space instead of typing , and then the source will look clean:

It is longer than  and embeds extra characters in generated HTML. Zero-width characters are difficult to input or to be checked whether inserted if directly input.

the source will look clean:

Zero-width characters are really dirty. The fact that someone just looking at the source cannot notice their existence shows that it is a bad way. They are not indented for input without combined with other characters by non-advanced users. For the first place, why are non-advanced users forced to remember such random numbers?

Although ZWSP is suitable in many cases in Japanese and Chinese, but in Koren, Thai, and other languages, U+2060 WJ can be more appropriate in some cases. You must not force users to choose either properly. Most users never know how to choose proper one. Korean uses a space, but it is for clauses, not words (e.g. **C#**을 추천합니다.; this sentence is translated using DeepL). There is no all-in-one appropriate invisible character for this purpose.

\ * is shorter than , and \ ** is much shorter than . They allow every people to add and even nest  &  everywhere intended without raw tags or JSX. We will not have to rely on or look into the current relatively complex rule deeply when trying to nest  anymore.

\ can be retained as-is unless adjacent to * (or _) for better backward compatibility.

I suspect few.

I agree with you. I doubt if there were one other than very mean test cases.

wooorm commented 5 months ago

Of course you can also just insert the unicode zero-width space instead of typing , and then the source will look clean: — @jgm https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1934597399

You can’t on the dingus / github currently. It doesn’t form strong. I think that behavior is per-spec. The punctuation characters for the character reference (& and ;) are important.

You should not at least as of now. Browsers other than Firefox have a bug to insert a extra space per newline like there, which is incompatible with web-platform-tests. — @tats-u https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1935830649

I think you missed the sentence after, where @jgm said and use -f markdown+ignore_line_breaks. I don’t particularly like this either though, as this would not work on GitHub or similar places that show markdown.

An escaped space is also ugly, but readable enough as a plaintext IMHO. — @yuin https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1935547776

The downside of it is that I think it’s quite likely markdown already exists that contains \ in the wild.

\ can be retained as-is unless adjacent to * (or _) for better backward compatibility.

— @tats-u https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1935830649

While being more compatible with existing markdown is nice, this is hard to implement in parsers the way they are set up currently, and I think it’s a bit difficult to explain that you can only do it around * and _.

At that point, perhaps we could make whether fences can open or close explicit with [ and ]? Such as a[**[*b*]**]c -> abc.

yuin commented 5 months ago

At that point, perhaps we could make whether fences can open or close explicit with [ and ]? Such as a[**[*b*]**]c -> abc .

I felt awkward a little for this approach:

It is hard to distinguish an emphasis from a link. e.g. [**[*b*]**](http://www.example.com)
This approach seems like the djot. To me, it looks like another markup language rather than Markdown.

The reason I wrote "an escaped space is a neat way" is:

It is very easy to implement in CommonMark parsers. We simply skip a character if the half-width space is escaped, just like any other escapeable character. It means we do not have to modify emphasis part(that is one of the most complicated spec).
It is just a workaround. We should use this notation only if the rendering does not work as expected. It is still Markdown feeling remaining IMHO.

The downside of it is that I think it’s quite likely markdown already exists that contains \ in the wild.

Absolutely exists. but I suspect few(I'm not sure). For most context, wild \ is a code and it should be an inline code like `\` means following character is escaped. Of course wild \ not surrounding by ` exists.... but I suspect few.

tats-u commented 4 months ago

I think you missed the sentence after, where @jgm said and use -f markdown+ignore_line_breaks.

Sorry I overlooked that condition.

I don’t particularly like this either though, as this would not work on GitHub or similar places that show markdown.

I agree with you completely including your reason.

a[**[*b*]**]c

We have to deal with a link whose label is long and wrapped completely with emphasis.

We're here. Let's see if `[` was the beginning of link or extended emphasis.
  ↓   ---------------------------------------------------------------↓ Left paren appeared! It was the beginning of a link!
[**sooooo looooong link label much longer than Picasso's full name**](link-address)

this is hard to implement in parsers the way they are set up currently,

*\ looks more difficult than \ * for optimized parsers because the former requires a peek operation of 2 characters (checking flankingness requires one). I regret to say that I cannot say a more detailed opinion because I have not been an expert of Markdown state machines.

you can only do it around * and _.

We can add ~~ for GFM, too. (If GFM authors approve) I think it is easier than to explain the flankingness. We can explain this \ as just an escape pod when *, _, or ~ do not work as intended.

The downside of it is that I think it’s quite likely markdown already exists that contains \ in the wild.

I wonder if there is one other than those that should be included in `...` or ```...``` ,or ASCII arts (ones similar as ¯\_(ツ)_/¯). And I believe the number of them is less than those affected by 0.31's change.

jgm commented 4 months ago

At that point, perhaps we could make whether fences can open or close explicit with [ and ]? Such as a[**[*b*]**]c -> abc.

Well, that's something like what I do in djot, but using { and }.

As for \, what would make me reluctant is that \ already means "nonbreaking space" in pandoc and perhaps other markdown variants. This is long-established and widespread, and it seems a bad idea to introduce an incompatible use. Otherwise the idea is a good one.

Here's another thought: if spaces aren't used in CJK, then we could simply ignore spaces adjacent to CJK characters. So you could write

太郎は **「こんにちわ」** といった

and it would turn into

太郎は<strong>「こんにちわ」</strong>といった

wooorm commented 4 months ago

*\ looks more difficult than \ * for optimized parsers because the former requires a peek operation of 2 characters

It would means looking at arbitrary characters: escapes can be escaped too: \\ *, \\\ *, \\\\ *, etc.

We can add ~~ for GFM, too. (If GFM authors approve)

That’s not something we need to choose here.

I think it is easier than to explain the flankingness.

The \ technique still has to do with flankingness, that doesn’t disappear. Nowhere else in CM are there characters that disappear in certain cases but not others.

wonder if there is one other than those that should be included in `...` or ```...```,or ASCII arts (ones similar as ¯\_(ツ)_/¯).

Whether users should wrap things in code is unrelated to CM. CM doesn’t tell you what to put into code. It does currently say that \ follows by non-ascii-punctuation is just a \.

CJK

If we’re detecting CJK characters, could we change something in whether runs can open/close based on whitespace/punctuation/CJK? Make it super permissible if we see CJK?

wooorm commented 4 months ago

What should be classified as CJK? What about https://linguistics.stackexchange.com/questions/6131/is-there-a-long-list-of-languages-whose-writing-systems-dont-use-spaces

wooorm commented 4 months ago

OK, this works well for the OP examples. And it passed all my 2k+ tests in micromark, including CM’s tests.

CJK:

2e80-2eff // CJK Radicals Supplement
2f00-2fdf // Kangxi Radicals
3040-309f // Hiragana
30a0-30ff // Katakana
3100-312f // Bopomofo
3200-32ff // Enclosed CJK Letters and Months
3400-4dbf // CJK Unified Ideographs Extension A
4e00-9fff // CJK Unified Ideographs
f900-faff // CJK Compatibility Ideographs
3000-303f // https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation
FF00-FFEE //  Halfwidth and Fullwidth Forms

Runs can open if the character before them is CJK and close if the character after is CJK.

But it would be nice to have more examples of from the CJK-speaking community of how things should work!

wooorm commented 4 months ago

If folks want to play with this, see the commit references above. Checkout the cjk branch on micromark/micromark, on a modern Node (16+) run npm install && npm test. Then change example.js if you want and run node example.js I added the examples in this thread in example.js and they all form strong. Which I think is the goal.

@yuin @jgm how viable is this in your parsers and tests and experience? Works well for me.

yuin commented 4 months ago

@jgm @wooorm

Here's another thought: if spaces aren't used in CJK, then we could simply ignore spaces adjacent to CJK characters. So you could write

Unfortunately... CJK is not so simple:

We Japanese uses English alphabets in sentenses: 日本語のテキストにEnglish text（英語のテキスト）を埋め込む。(note that parenthesis is '（'(full width), not '(' half width. But we also uses ( it depends)
- This assumption(Japanese is supposed to use spaces around English alphabets) is causing a promblem in CSS-text-3 segmentation break specification: https://github.com/w3c/csswg-drafts/issues/5086
Korean uses spaces: https://www.90daykorean.com/korean-spacing/

As for \ , what would make me reluctant is that \ already means "nonbreaking space" in pandoc and perhaps other markdown variants. This is long-established and widespread, and it seems a bad idea to introduce an incompatible use.

I felt this opinion contains a little bias as a pandoc author. For now, I believe CommonMark is most widely used markdown specifications(as we aimed to). In CommonMark, \ has no special meanings.

e.g. : https://forum.obsidian.md/t/escaped-space-as-non-breaking-space-in-viewing-mode/61254
- \ as nonbreaking space requested, but it is refused by This is not part of the commonmark spec

I've tested your opinion with Babelmark3: Result

a\ \ b is rendered as a\ \ b: 28 implementations
a\ \ b is rendered as a  b : 1 implementation(pandoc)

So your opinion seems biased.

tats-u commented 4 months ago

It would means looking at arbitrary characters: escapes can be escaped too: \\ *, \\\ *, \\\\ *, etc.

Have even numbers of \'s already been accepted as pure sequences of escaped \'s as of `? \ and will be treated as the same way by a followingwhen considering's flankingness. If follows an not-yet-escaped`, it is now allowed to check if it is followed by * like the way * checks its flankingness.

CJK:

We might to have to treat Korean as

Here's another thought: if spaces aren't used in CJK, then we could simply ignore spaces adjacent to CJK characters. So you could write

This should be done by independent plugins, not by CommonMark HQ itself. We sometime use (, ) (note: ASCII), “, or ” (note: ambiguous with; i.e. full width in CJ(K?) fonts) without a surrounding space.

jgm commented 4 months ago

@woorm's suggestion is nice and simple.

The only objection that has been raised is that CJK writing sometimes contains Latin characters too. Can somebody give us a real-world test case where @woorm's proposal would fail because of that, so we can think about it more concretely?

tats-u commented 4 months ago

[*...*] can be interpreted as a combination of a footnote label [...] and an emphasis. Some Markdown extensions e.g. remark-directive (:::warning[Admonition Title]) and GitHub-style admonition are built under the footnote label. I wonder if someone (not a few people) has surround the entire text of a label with * or _.

jgm commented 4 months ago

Can we focus on @woorm's suggestion? "Runs can open if the character before them is CJK and close if the character after is CJK." This is an extremely simple and intuitive solution (unlike solutions involving escaped spaces, brackets, braces, and the like). It has the large advantage that there is nothing new users will have to learn; it will just behave the way they intuitively expect. If there are good arguments against it, I'd like to see them clearly stated.

ptmkenny commented 4 months ago

I like @wooorm's solution.

In this post, I will collect all the examples in this thread to make it easier for people who are doing testing.

My original examples:

**テスト。**テスト

**テスト**。テスト

**テスト、**テスト

**テスト**、テスト

**テスト？**テスト

**テスト**？テスト

Examples from @tats-u:

**これは太字になりません。**ご注意ください。

カッコに注意**（太字にならない）**文が続く場合に要警戒。

**[リンク](https://example.com)**も注意。（画像も同様）

先頭の**`コード`も注意。**

**末尾の`コード`**も注意。

Example from @yuin:

太郎は**「こんにちわ」**といった

Additional examples from me to test common cases related to mixing English and Japanese:

// Test single-width punctuation mixed with double-byte characters.

太郎は**"こんにちわ"**といった

太郎は**こんにちわ**といった

// Ensure single English words in Japanese can be emphasized.

太郎は**「Hello」**といった

太郎は**"Hello"**といった

太郎は**Hello**といった

// Ensure English phrases in Japanese can be emphasized

太郎は**「Oh my god」**といった

太郎は**"Oh my god"**といった

太郎は**Oh my god**といった

tats-u commented 4 months ago

The most famous Japanese Markdown resource is Qiita, and you can get raw Markdown by adding .md after the URL of articles. However, it would be difficult to search the suggested cases because it drops symbols on search.

"Runs can open if the character before them is CJK and close if the character after is CJK."

Sorry I overlooked it. It seems difficult to find exceptions (i.e. ones that cannot be covered by this draft revision), So it is reasonable for a springboard. Possibly we might have to check the next next characters, but I can never be sure of it.

I prepared for an additional test case, but it is covered by that draft too:

**C#**や**F#**は**「.NET」**というプラットフォーム上で動作します。

Also we will have to add characters in some other Asian languages (e.g. Thai) later. It should be noted that the number of native GitHub users of those languages is smaller than that of CJK, so it is difficult for requests to add them to the list (CJK chracters) to come to the surface. We might have to come up with a name that covers more than CJK instead of CJK.

Should open- and close-only emphasis syntax like \ *...*\ be discussed in another independent issue?

tats-u commented 4 months ago

It might be better if we can add number digits to "CJK characters", too.

Call the number **(000)**0000-0000.
The price is ***€*100** now! (note: not US$ or £!)

commonmark / commonmark-spec

Emphasis with CJK punctuation #650