Emphasis with CJK punctuation

ptmkenny commented 4 years ago

Hi, I encountered some strange behavior when using CJK full-width punctuation and trying to add emphasis.

Example punctuation that causes this issue:

。！？、

To my mind, all of these should work as emphasis, but some do and some don't:

**テスト。**テスト

**テスト**。テスト

**テスト、**テスト

**テスト**、テスト

**テスト？**テスト

**テスト**？テスト

I'm not sure if this is the spec as intended, but in Japanese, as a general rule there are no spaces in sentences, which leads to the following kind of problem when parsing emphasis.

In English, this is emphasized as expected:

This is **what I wanted to do.** So I am going to do it.

But the same sentence emphasized in the same way in Japanese fails:

これは**私のやりたかったこと。**だからするの。

rxliuli commented 7 months ago

It might be better if we can add number digits to "CJK characters", too.
Call the number **(000)**0000-0000.
The price is ***€*100** now! (note: not US$ or £!)

Maybe another issue? This problem seems to be encountered much less often.

tats-u commented 7 months ago

@rxliuli

This problem seems to be encountered much less often.

Correct, this is just "It would be better if we add them together".

ptmkenny commented 7 months ago

@rxliuli I think it's reasonable to aim to support numbers in CJK to the same extent they are supported in English.

It's important to test both single-byte and double-byte numbers. For example, these should be handled:

IDが**001**になります。
IDが**００１**になります。
IDが**001号**になります。
IDが**００１号**になります。

tats-u commented 7 months ago

It's important to test both single-byte and double-byte numbers.

Note: double-width numbers are in FF00-FFEE // Halfwidth and Fullwidth Forms.

We have not added some minor ideographs and hentaigana (super very minor even in daily life; I have only seen those on billboards or hanafuda cards (akayoroshi).);

U+1B000–U+1B16F (Kana Supplement & Kana Extended-A & Small Kana Extended) (lower priority)
U+20000–U+3FFFF (All characters in Supplementary Ideographic Plane & Tertiary Ideographic Plane)
U+E0100–U+E01EF (Ideographic Variation Sequence) (forgot to add)

rxliuli commented 7 months ago

@rxliuli

This problem seems to be encountered much less often.

Correct, this is just "It would be better if we add them together".

Of course, I'm just worried that this issue requires too much support and will fall into stagnation again. After all, this problem has existed for too long.

yuin commented 7 months ago

@wooorm 's approach may not work for like Go**「初心者」**を対象とした記事です。 .

As I've wrote

It seems impossible to perfect the emphasis notation where the start and end symbols are the same

It is not a perfect, but makes better for CJK.
And it is not 'This is an extremely simple' for CommonMark parser authors because a character definition that should ignore surrounding space is complicated.

However, as we saw it looks like CommonMark owners will not to added new notation like an escaped space, wooorm's approach seems the only possible CommonMark spec solution for CJK .

goldmark already has a character definition which should ignore surrounding spaces: https://github.com/yuin/goldmark/blob/master/util/util_cjk.go#L192

This definition is based on W3C 'Space-Discarding Unicode Characters', so it should work to some extent for non-CJK languages that have same trouble with CJK (I'm not sure). Note that this definition most likely inperfect(There's probably an ongoing discussion somewhere).

By the way, If this kind of approach is acceptable, we may have a chance to resolve 'unnecessary space' problem in CJK. CJK users often see 'unnecessary spaces' because newline is rendered as a space. goldmark CJK extension removes a newline based on above 'Space-Discarding Unicode Characters' list.

jgm commented 7 months ago

we may have a chance to resolve 'unnecessary space' problem in CJK.

Discussion of that should probably go in a separate issue. Pandoc also has an option to ignore newlines, as well as an option to ignore space adjacent to CJK characters. But these things don't require a change to the parser; they can be implemented as renderer options.

yuin commented 7 months ago

Discussion of that should probably go in a separate issue.

I agree. So I avoid discussing the details, 'they can be implemented as renderer options' but it is not in the CommonMark spec, so many library does not implement this. If this in the CommonMark spec, many library will implement it. In addition, pandoc implementation is too naive that has a lot of missing cases.

tats-u commented 7 months ago

Go**「初心者」**を対象とした記事です。

This it it.

Runs can open if the character before them is CJK

~~Runs can open if the character at least one of before them, after them (=punctuation itself), or after after them (=after puctuation) is CJK~~

Runs can open if the character at least one of before them, at least one punctuation of the punctuation sequence (=punctuation itself), or after the punctuation sequence (=after puctuation) is CJK

↑Changing to this increases the coverage range, but parsers will be more complex because they are required to be able to remember multiple previous chracters and peek multiple characters after.

wooorm commented 7 months ago

Replies to questions from earlier proposals:

Have even numbers of \'s already been accepted […] — @tats-u

No, escapes or not do not affect the emphasis/strong algo currently. You can see this with character references such as   (which would be a space) or a (which would be an a): those “would be” character are not used. The & and ; instead matter. This behavior may sound useless, but such behavior is what allows users to work around this very issue, for languages that do not use whitespace for example.

[*...*] can be interpreted as a combination of a footnote label [...] and an emphasis. — @tats-u

Those are not footnote labels. GFM, which includes footnotes, does not use display the value in footnote labels, which are written as [^a]. Anything after ^ and before ] is dropped. And it cannot include whitespace. And yes, extensions would have to deal with that, but I don’t worry about directives or footnotes!

Re numbers:

Correct, this is just "It would be better if we add them together". — @tats-u

The downside is that there is a reason for all this detection: people also use _ and * in words and want to see them. I get that theoretically someone might want **(000)**0000-0000. and ***€*100** to be bold. But also like... I can also come up with cases where people want * in a word (n*2=4 or so) We don’t want to go too far and see everything as a boundary.

I think it's reasonable to aim to support numbers in CJK to the same extent they are supported in English. @ptmkenny

These examples of numbers are supported by my proposal, as they are surrounded by CJK. They’re at boundaries already.

Re space-discarding characters:

This definition is based on W3C 'Space-Discarding Unicode Characters', so it should work to some extent for non-CJK languages that have same trouble with CJK (I'm not sure). Note that this definition most likely inperfect(There's probably an ongoing discussion somewhere). @yuin

This definitions has been removed from CSS. Do you have insights on why it was removed, and what replaced it?

Runs can open if the character at least one of before them, at least one punctuation of the punctuation sequence (=punctuation itself), or after the punctuation sequence (=after puctuation) is CJK — @tats-u

I don’t understand this sentence. I do not know what you mean. Can you please rephrase it?

wooorm commented 7 months ago

And re CJK around line endings: I don’t think this needs to be discussed here, now. And, if it was discussed, I probably think it has no place in CM. Proof of that is that CSS is solving it. Because in HTML too people use line endings to hard wrap text. It doesn’t require a change to the HTML spec. But to CSS.

yuin commented 7 months ago

@wooorm

(This is the last reply about line endings. If necessary, let's set up a separate issue for discussion)

If I understand current situation correctly, you misunderstood a little. CSS do not resolve CJK line ending issues for now. This kind of Segment Break Transformation Rules is not implemented in chronium and most browsers.

Ideally, we hope many browsers implement it. But we need a practical solution. From you westerners point of view, "Please be patient, it will be implemented someday", but often "someday" about this kind of non-westerner problems is the distant future or never coming(like this emphasis issue).

But I do not expect that this kind of things defined in CommonMark. So I've implemented this as an extension for my library. I only meant to touch up this topic lightly.

Sorry for the confusion :pray: We got sidetracked, so let's get back to the main subject.

This definitions has been removed from CSS. Do you have insights on why it was removed, and what replaced it?

I do not know details about this. But considering above situation, I think that they seems almost gave up this sort of approach. There are many threads left open about this on github. e.g. : https://github.com/w3c/csswg-drafts/issues/5017

How big of a world we live in, so many languages exist, 150,000 characters defined in Unicode, this kind of 'if XXX character is before...' approach seems thorny path.

However, in the discussions so far this kind of approach seems the only possible CommonMark spec solution for CJK . We CJK people would be happy if this kind of solution become CommonMark spec rather than nothing happen about this issue.

I obviously don't know all language in the world, but I think goldmark's character list will be helpful for this kind of approach.

jgm commented 7 months ago

@tats-u - I assume this was meant to be a counterexample to the original proposal?

Go**「初心者」**を対象とした記事です。

What about this?

CJK characters, including letters, digits, and symbols and punctuation, are all treated as equivalent to letters for purposes of determining flankingness.

With that change your example should be handled fine. You'd only run into problems if you used non-CJK punctuation, e.g.

Go**"初心者"**を対象とした記事です。

wooorm commented 7 months ago

I feel like you misread my comment. The work happens in CSS and not HTML. That’s my point. For the rest of your comments, I join you in wanting to burn the imperialist west down.

I do not know details about this

I tracked it down and feedback from CJK users showed that the algorithm was not good.

OK, made a table of characters that are currently being discussed.

basic list	w3c	Unicode Range	Description	Current classification
x	x	2e80-2eff	CJK Radicals Supplement	punctuation
x	x	2f00-2fdf	Kangxi Radicals	punctuation
	x	2FF0-2FFF	eographic Description Characters	punctuation
x	x	3000-303F	JK Symbols and Punctuation	mixed punctuation and other
x	x	3040-309f	Hiragana	mostly other; some punctuation
x	x	30a0-30ff	Katakana	other
x		3100-312f	Bopomofo	other
	x	3130-318F	Kanbun	other
	x	3190-319F	Kanbun	mostly other; some punctuation
	x	31C0-31EF	CJK Strokes	punctuation
	x	31F0-31FF	Katakana Phonetic Extensions	other
x		3200-32ff	Enclosed CJK Letters and Months	mostly punctuation; some other
	x	3300-33FF	CJK Compatibility	punctuation
x	x	3400-4dbf	CJK Unified Ideographs Extension A	other
x	x	4e00-9fff	CJK Unified Ideographs	other
	x	A000-A48F	Yi Syllables	other
	x	A490-A4CF	Yi Radicals	punctuation
x	x	f900-faff	CJK Compatibility Ideographs	other
	x	FE10-FE1F	Vertical Forms	punctuation
	x	FE30-FE4F	CJK Compatibility Forms	punctuation
	x	FE50-FE6F	Small Form Variants	punctuation
x	x	FF00-FFEE	Halfwidth and Fullwidth Forms	mixed punctuation and other
	x	1B000-1B0FF	Kana Supplement	other
	x	1B100-1B12F	Kana Extended-A	other
	x	1B130-1B16F	Small Kana Extension	other
	x	20000-2A6DF	CJK Unified Ideographs Extension B	mostly other; some punctuation
	x	2A700-2B73F	CJK Unified Ideographs Extension C	mostly other; some punctuation
	x	2B740-2B81F	CJK Unified Ideographs Extension D	other
	x	2B820-2CEAF	CJK Unified Ideographs Extension E	other
	x	2CEB0-2EBEF	CJK Unified Ideographs Extension F	other
	x	2F800-2FA1F	CJK Compatibility Ideographs Supplement	other
	x	30000-3134F	CJK Unified Ideographs Extension G	mostly other; some punctuation

wooorm commented 7 months ago

And a table of examples.

Question for CJK speakers: can you come up with real sentences that contain these characters, and figure out a test case that currently does not work with https://spec.commonmark.org/dingus/

Unicode Range	Description	Example	Solved with proposal?
2e80-2eff	CJK Radicals Supplement	`テスト。テスト`	Yes
2f00-2fdf	Kangxi Radicals
2FF0-2FFF	Ideographic Description Characters
3000-303F	JK Symbols and Punctuation	`太郎は「こんにちわ」といった`	Yes
3040-309f	Hiragana	`[リンク](https://example.com)も注意。`, `先頭の`コード`も注意。`	Yes
30a0-30ff	Katakana
3100-312f	Bopomofo
3130-318F	Kanbun
3190-319F	Kanbun
31C0-31EF	CJK Strokes
31F0-31FF	Katakana Phonetic Extensions
3200-32ff	Enclosed CJK Letters and Months
3300-33FF	CJK Compatibility	`これは100㌍です。`	No
3400-4dbf	CJK Unified Ideographs Extension A
4e00-9fff	CJK Unified Ideographs
A000-A48F	Yi Syllables
A490-A4CF	Yi Radicals
f900-faff	CJK Compatibility Ideographs
FE10-FE1F	Vertical Forms
FE30-FE4F	CJK Compatibility Forms
FE50-FE6F	Small Form Variants
FF00-FFEE	Halfwidth and Fullwidth Forms	`テスト？テスト`, `カッコに注意（太字にならない）文が続く場合に要警戒。`	Yes
1B000-1B0FF	Kana Supplement
1B100-1B12F	Kana Extended-A
1B130-1B16F	Small Kana Extension
20000-2A6DF	CJK Unified Ideographs Extension B
2A700-2B73F	CJK Unified Ideographs Extension C
2B740-2B81F	CJK Unified Ideographs Extension D
2B820-2CEAF	CJK Unified Ideographs Extension E
2CEB0-2EBEF	CJK Unified Ideographs Extension F
2F800-2FA1F	CJK Compatibility Ideographs Supplement
30000-3134F	CJK Unified Ideographs Extension G

yuin commented 7 months ago

Great wrap-up.

@jgm 's example Go**"初心者"**を対象とした記事です。 is not working in commonmark dingus. Where should be it in your table?

jgm commented 7 months ago

@yuin is this a realistic example? Would people use " in this context? I suppose another similar case would use a yen symbol after the **.

wooorm commented 7 months ago

No that example doesn’t get solved by this approach or CM currently. Yes, as far as I understand it’s a real use case: folks should use CJK quotes but from what I read in Unicode more and more people are sometimes a bit more lax and just use easy to type characters.

reiterating: I am looking for more examples. Please, CJK users, look at the above table and see if you can come up with sentences that contain those characters and things that could be emphasized :)

yuin commented 7 months ago

@jgm Yes. We also use half-width symbols without spaces(sciences, tech/technical people prefer halfwidth symbols). However, I think halfwidth symbols with emphasis is a rare case, so I think it can be ignored rather than nothing happens for this issue.

yuin commented 7 months ago

Usually, we use words without character codes in mind, so it's quite difficult...

CJK Radicals Supplement: Seems invalid(not CJK Radicals Supplement). **⻲田太郎**と申します(⻲ is it)
Hiragana: Also contains Katakana: コード is a katakana
Vertical Forms: Can not use in CommonMark(I believe). We sometimes write in 'vertical', but CommonMark only consider 'horizontal' forms.
Katakana Phonetic Extensions: almost never use it
Enclosed CJK Letters and Months: ・**㋐**:選択肢１つ目 ㋐ is it
CJK Compatibility: これは100**㌍**です。 ㌍ is it.
- ㌍ is a shorthand for the word カロリー. 3 characters in 1 character.
Small Form Variants: almost never use it(I think)
Halfwidth and Fullwidth Forms: Seems invalid. It does not contain Halfwidth. これは**ﾃｽﾄ**です。 ﾃｽﾄ is halfwidth.
Kana Supplement, Kana Extended-A, Small Kana Extension: It used somewhere, but I Nothing's coming to mind.

It probably works for noticed(already listed) categories(at least Japanese). The Hard parts may be

mixed with other language alphabets and symbols
Other languages we haven't noticed (e.g. Thai?)

As I mentioned earlier, It would be happy if this kind of solution become CommonMark spec for Japanese! For hard(and rare) parts, we can use HTML tags.

rxliuli commented 7 months ago

The main issue with Chinese text is the occurrence of symbols, such as the common symbols in Chinese listed below. If they appear before the right side of bold or italic text, it causes rendering errors.

example	render
`真，她`	真，她
`真。她`	真。她
`真、她`	真、她
`真；她`	真；她
`真：她`	真：她
`真？她`	真？她
`真！她`	真！她
`真“她`	真“她
`真”她`	真”她
`真‘她`	真‘她
`真’她`	真’她
`真（她`	真（她
`真）她`	真）她
`真【她`	真【她
`真】她`	真】她
`真《她`	真《她
`真》她`	真》她
`真—她`	真—她
`真～她`	真～她
`真…她`	真…她
`真·她`	真·她
`真〃她`	真〃她
`真-她`	真-她
`真々她`	真々她
`真她`	真她

There are some workarounds, but they are not intuitive, and without checking the rendering result, it would not be possible to know.

example	render
`真，她`	真，她
`真，她`	真，她
`真，&ZeroWidthSpace;她`	真，她

Comma (，): Used for a brief pause within a sentence. Period (。): Used to end a sentence. Enumeration comma (、): Used between parallel terms. Semicolon (；): Used to separate clauses within a complex sentence, indicating a pause longer than a comma. Colon (：): Used to introduce explanations, reasons, results, etc. Question mark (？): Used at the end of interrogative sentences. Exclamation mark (！): Used to express strong emotions. Quotation marks (“” and ‘’): Used to directly quote someone's words or indicate specific nouns. Parentheses (（） and 【】): Used to insert supplementary explanations or comments. Book title marks (《》): Used for the titles of books, articles, etc. Dash (——): Used for emphasis or explanation. Tilde (～): Indicates range or fluctuation. Ellipsis (……): Indicates interruption or omission of speech. Interpunct (·): Used between names or to separate parts. Emphasis mark (〃): Used to emphasize a certain word or phrase. Hyphen (-): Used to connect words. Ditto mark (々): Indicates the immediate repetition of the preceding character or word.

yuin commented 7 months ago

@rxliuli

All your example seems resolved by the propsal :) .

My Note:

My thoughts about nesting. Some nesting still works in different way:

this is **foo* bar.* baz

これは**ほげ*ふが。*です

<p>this is <em><em>foo</em> bar.</em> baz</p>
<p>これは**ほげ<em>ふが。</em>です</p>

I tried many different ways(inserting spaces anywhere), It is hard to make same emphasis as English. But honestly say, I have never seen nested emphasis in production. So I don't think this is a problem.

wooorm commented 7 months ago

@yuin

we use words without character codes in mind, so it's quite difficult...

If you look up the label on Wikipedia, so “CJK Radicals Supplement” or so, you can see the character there.

Can you come up with:

a non-punctuation character before **⻲ in **⻲田太郎**と申します (CJK radicals)
a non-punctuation character instead of ・ in ・**㋐**:選択肢１つ目 (Enclosed CJK Letters and Months) Then it currently doesn’t form, but will form.

Vertical Forms: Can not use in CommonMark(I believe). We sometimes write in 'vertical', but CommonMark only consider 'horizontal' forms.

Why not? I read LTR and RTL languages and those are just bytes, why would things that finally end up being displayed vertically not work?

The Hard parts may be

mixed with other language alphabets and symbols Other languages we haven't noticed (e.g. Thai?)

Yep, I’ll search for Burmese / Myanmar, Dzongkha, Javanese, Khmer, Lao, Tai Lue, Thai, and Tibetan speakers next :)

As I mentioned earlier, It would be happy if this kind of solution become CommonMark spec for Japanese! For hard(and rare) parts, we can use HTML tags.

Yes, working on it! But first I want to get some good examples. So that we know what doesn’t work but should realisticly and reasonably work.

wooorm commented 7 months ago

@rxliuli Thanks! Some of these are indeed covered by this proposal. And have examples already (3000-303F: JK Symbols and Punctuation and FF00-FFEE: Halfwidth and Fullwidth Forms). Some others, such as — U+2018 (‘), U+2019 (’), U+201c (“), U+201d (”) are not covered. Because they are not specific to CJK. As far as I understand they shouldn’t (much) be used in CJK languages, as there are better alternatives? Some of these examples look theoretical to my eye. For **真“**她 for example, that looks like someone opened a quote but didn’t close it. Why would someone put emphasis around a broken quote? I am really looking for well-formed sentences where the emphasis makes sense: where the emphasis is around a “whole unit”.

rxliuli commented 7 months ago

Some others, such as — U+2018 (‘), U+2019 (’), U+201c (“), U+201d (”) are not covered. Because they are not specific to CJK. As far as I understand they shouldn’t (much) be used in CJK languages, as there are better alternatives?

From what I understand, these symbols are often used in Chinese, and there really aren't alternative symbols. As mentioned above, they are used to express spoken content or quote others, which is very common.

Some of these examples look theoretical to my eye. For 真“她 for example, that looks like someone opened a quote but didn’t close it. Why would someone put emphasis around a broken quote? I am really looking for well-formed sentences where the emphasis makes sense: where the emphasis is around a “whole unit”.

Yes, "真“她" is not a good example; it would generally be "真”她", but it's just mentioned in passing, after all, they should be some of the symbols adjacent to each other in Unicode, right?

If you're looking for real-world examples, then a Chinese novel I previously maintained has 2711 occurrences of this, which is very common. The solution I'm currently using is to add extra spaces. But as mentioned above, this is not intuitive for maintainers.

Regex Search \*\*.*?[，。、；：？！“”‘’（）【】《》—～…·〃-々]\*\*

https://github.com/search?q=repo%3Aliuli-moe%2Fto-the-stars+%2F%5C*%5C*.*%3F%5B%EF%BC%8C%E3%80%82%E3%80%81%EF%BC%9B%EF%BC%9A%EF%BC%9F%EF%BC%81%E2%80%9C%E2%80%9D%E2%80%98%E2%80%99%EF%BC%88%EF%BC%89%E3%80%90%E3%80%91%E3%80%8A%E3%80%8B%E2%80%94%EF%BD%9E%E2%80%A6%C2%B7%E3%80%83-%E3%80%85%5D%5C*%5C*+%2F&type=code

yuin commented 7 months ago

@wooorm

Can you come up with: a non-punctuation character before ⻲ in ⻲田太郎と申します (CJK radicals) a non-punctuation character instead of ・ in ・㋐**:選択肢１つ目 (Enclosed CJK Letters and Months) Then it currently doesn’t form, but will form.

私は**⻲田太郎**と申します (CJK radicals)
選択肢**㋐**: 1つ目の選択肢 (Enclosed CJK Letters and Months)

Why not? I read LTR and RTL languages and those are just bytes, why would things that finally end up being displayed vertically not work?

That makes sense. I tried it and it works! But it can not be rendered correctly on Github, I've used the dingus: Link

**さようなら︙**と太郎はいった。 : ︙ is a Vertical Forms( of …).

ArcticLampyrid commented 7 months ago

Runs can open if the character before them is CJK and close if the character after is CJK.

Originally posted by @wooorm in https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1937082248

I approve this proposal. This will resolve a lot of problems for Chinese users.

tats-u commented 7 months ago

I don’t understand this sentence. I do not know what you mean. Can you please rephrase it?

description

However, parsers will become more complex and the explainability of the algorithm for those other than library authors will be decreased.

Also, this still cannot fix the following example:

.NET**（.NET Frameworkは不可）**では、

n*2=4

This made me give up for numbers. I thought they should escape * though.

No, escapes or not do not affect the emphasis/strong algo currently.

I confirmed in dingus. Theoretically we can extend HTML escape like &noop; and replace it with an empty string, which looks as bad as ZWSP though.

Would people use " in this context?

“ & ” (U+201C–201D) are much commoner than ". If we type " key (Shift + 2 in a Japanese keyboard) in keyboard when a Japanese IME is on, we get ” (U+201D).

Kana Supplement

Here is Kana Supplement example (requires IPA mj Mincho to show):

あ𛀙**（か）**よろし

hanafuda-akayoroshi

↑Image source: https://www.tengudo.jp/blog/karuta-news/323.html (Japanese)

As far as I understand they shouldn’t (much) be used in CJK languages,

These have caused much trouble in macOS. I have created a Rust library to handle them in HFS+. Most WIndows and Linux users do not have the opportunity to handle them.

CJK Radicals Supplement

I produced the following example that was likely to contain them using Google Japanese Input:

「禰󠄀」の偏は示ではなく**礻**です。

However, 示 & 礻 here are normal kanjis (CJK Unified Ideograph). I wonder how normal Japanese input them from romaji.

https://www.benricho.org/moji_conv/14-kanji_koukibusyucheck/ (Japanese)

As far as I understand they shouldn’t (much) be used in CJK languages, as there are better alternatives?

Sometimes appear in books unrelated to computer. No alternatives. FYI I prefer 「」 to “” and 『』 to ‘’, but someone tell “” from 「」.

I approve this proposal. This will resolve a lot of problems for Chinese users.

Some text does not work as intended:

Git**（注：不是GitHub）**

It seems impossible to fix this issue completely in a simple way. Depends on how much we can compromise.

**⻲田太郎**と申します

亀 (U+4E80) is so much commoner here. I doubt that someone use U+2EF2.

tats-u commented 7 months ago

**さようなら︙**と太郎はいった。 : ︙ is ...

We can convert from … to that character thanks to CSS in vertical writing mode:

https://codepen.io/tats-u/pen/oNVmNvY

I think we do not have to consider such characters dedicated to vertical writing while we use CSS to render text.

shepherdwind commented 5 months ago

I have tracked this issue all the way, and many people have raised it. I encountered it when using outline as well. I understand that the fundamental reason for this unexpected rendering is that the logic for determining whether ** can be closed is incorrect.

For example, in the case of **aa+**, it is recognized as a strong tag, but in the case of **aa+**a, it directly outputs **. The difference lies in whether there is whitespace or a character at the end. If there's whitespace, then it can be closed.

The fundamental problem is that the parsing process does not consider the previous state. I think a more reasonable approach would be: if there's an open `before, then encountering another` later should always allow closing. Based on this principle, I also implemented a markdown-it plugin at https://github.com/shepherdwind/match-pairs . With this plugin, most similar issues can be resolved and all old markdown-it test cases pass.

ArcticLampyrid commented 5 months ago

I understand that the fundamental reason for this unexpected rendering is that the logic for determining whether ** can be closed is incorrect.

No, the fundamental reason is that the spec is designed to supports nested emphasis, and then introduces a adaptive method to detect whether ** is used for opening tag or closing tag.

Take a look at **aa+**a****, it should render as aa+a. Yes, a  tag in another  tag.

I don't think this is an important feature, but the specification does.

if there's an open before, then encountering another later should always allow closing.

Then nested emphasis won't be supported any more.

shepherdwind commented 5 months ago

Thank you for the reminder. I checked the spec document and indeed it supports nesting. I think we can adjust this part of the logic as follows:

If there's an open ** before, then encountering another ** later, we need to check if there is a closable tag after this **. If so, keep it unchanged; otherwise, this ** should be closable.

Just add another condition to ensure compatibility with the old logic. I think this should work. I'll try it out and add some commonmark test cases.

yuin commented 5 months ago

No, the fundamental reason is that the spec is designed to supports nested emphasis, and then introduces a adaptive method to detect whether ** is used for opening tag or closing tag.

Let me add to that "adaptive" method is designed as a heuristic (currently) only based on some space separated western languages. This method consists of 17 rules. It is very hard to realize these rules for writers when their markup is not rendered as expected. Take the matter further, original Markdown converts

**aa+**a -> aa+a
太郎は**「こんにちわ」**といった。 -> 太郎は「こんにちわ」といった。
**test*aaa*** -> testaaa : original Markdown supports nested emphasis in some cases

This means emphasis in CommonMark is incompatible with original Markdown, and this issue was introduced by CommonMark.

ArcticLampyrid commented 5 months ago

It is very hard to realize these rules for writers when their markup is not rendered as expected.

I agree. It's worse that CJK did not use space or any other char to split words.

If only some special uses are required to handle, I would advice to introduce escape sequences.

But if most CJK users must use escape sequences in almost every time for emphasis, it's unacceptable. For this issue, it's required to make changes other than introducing escape sequences.

For me, dropping support of nested emphasis is more acceptable than I must use escape sequences or HTML tags aaa every time.

tats-u commented 5 months ago

the fundamental reason is that the spec is designed to supports nested emphasis

This is why this problem can't be completely solved. We have to have to compromise to some extent.

I decided to prioritize the candidates of solutions; the easiest and safest one is to check only if either of the nearest characters to ** (or *) are CJK.

adjacent-chars adjacent-chars2

↑ We don't have to check other characters than these two.

I'll put other plans/ideas on ice because they're less efficient.

We can treat all characters in U+20000–U+3FFFF as CJK because both planes are reserved for minor han characters.

Here is an example for Ideographs Extension G: 𰻞𰻞**（ビャンビャン）**麺. However H is also available now so I'm concerned about such work may be endless.

Block	Example	Remarks
Katakana	`ハイパーテキストコーヒーポット制御プロトコル(HTCPCP)`
CJK Compatibility Ideographs	`﨑(崎)`	"﨑" is an unofficial form but sometimes used in last names instead of "崎"
CJK Unified Ideographs	`国際規格[ECMA-262](https://tc39.es/ecma262/)`
CJK Unified Ideograph Extensions A	`㐧(第の俗字)`	Unofficial form
CJK Unified Ideograph Extensions B	`𠮟(こちらが正式表記)`
CJK Unified Ideograph Extensions C	`𪜈(トモの合略仮名)`	Mixed-in uncommon joined katakana
CJK Unified Ideograph Extensions D	`𫠉(馬の俗字)`	Unofficial form
CJK Unified Ideograph Extensions E	`谺𬤲(こだま)石神社`	shrine
CJK Unified Ideograph Extensions F	`石𮧟(いしただら)`	address

jgm commented 5 months ago

This is a very complex discussion. I think something along the lines of what @tats-u suggests is a practical and minimalist solution. As I understand it, the idea would be that a delimiter run, e.g. **, would be treated as both left- and right-flanking if either of the surrounding characters is CJK. We'd need to specify which code points count as CJK.

Is that correct @tats-u ? If so, let me know what a sensible CJK code point range is and I could try implementing this in cmark to see how it works.

wooorm commented 5 months ago

Posting https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1939190815 again. I was still waiting for more examples from speakers of various languages.

tats-u commented 5 months ago

@jgm

both left- and right-flanking if either of the surrounding characters is CJK. We'd need to specify which code points count as CJK.

Your recognition is correct. We may be able to enlarge the range to *, too.

We may have to add more complex rules to deal with cases that can't be handled with this rule in the future, but we haven't known their severity until now.

tats-u commented 5 months ago

Some Unicode blocks are too minor to be used on a daily basis, so we can use tentative examples using characters codepoints like 𠮟**(U+20B9F)**. (Parenthesis will be better if ASCII)

Kana Supplement - not used in normal Japanese
CJK Unified Ideograph Extensions D - only unofficial forms
(Kana Extended-A) - only billboard or hanafuda
Small Kana Extension - I wonder where they're used

Yi characters aren't used by Mandarin. It'll be hard to find people who use them on a daily basis in GitHub.

jgm commented 5 months ago

Can you give me the code point ranges you think we should be treating as CJK? (the actual numbers, not just names)

rxliuli commented 5 months ago

Can you give me the code point ranges you think we should be treating as CJK? (the actual numbers, not just names)

CJK Unified Ideographs Basic Block: U+4E00 - U+9FFF
CJK Unified Ideographs Extension A: U+3400 - U+4DBF
CJK Unified Ideographs Extension B: U+20000 - U+2A6DF
CJK Unified Ideographs Extension C: U+2A700 - U+2B73F
CJK Unified Ideographs Extension D: U+2B740 - U+2B81F
CJK Compatibility Ideographs: U+F900 - U+FAFF
CJK Compatibility Ideographs Supplement: U+2F800 - U+2FA1F

I generally use /[\u4E00-\u9FFF]/ to determine if it's a CJK character, but let's hear what @tats-u has to say about it.

ArcticLampyrid commented 5 months ago

BTW, I wonder which is better to maintain, allowlist or denylist?

Enable the adaptive rule only if it's around western languages, or just disable it only if it's around CJK chars?

ArcticLampyrid commented 5 months ago

IMO, I prefer "Enable the adaptive rule only if it's around western languages", bacause:

Most languages that split words by space have a fixed alphabet, it's easy to write non-outdated rules.
As I know, nested emphasis is used very rarely. So it's better to have intention to disable it unless it's sure that we are in expected environment.
Some languages may not use spaces to split words too, but they cannot make such volume like CJK users.

jgm commented 5 months ago

Implemented this in cmark to test the idea:

  left_flanking = numdelims > 0 && !cmark_utf8proc_is_space(after_char) &&
                  ((!cmark_utf8proc_is_punctuation_or_symbol(after_char) ||
                   cmark_utf8proc_is_space(before_char) ||
                    cmark_utf8proc_is_punctuation_or_symbol(before_char)) ||
                   (cmark_utf8proc_is_CJK(before_char) ||
                    cmark_utf8proc_is_CJK(after_char)));
  right_flanking = numdelims > 0 && !cmark_utf8proc_is_space(before_char) &&
                   ((!cmark_utf8proc_is_punctuation_or_symbol(before_char)
                     || cmark_utf8proc_is_space(after_char) ||
                     cmark_utf8proc_is_punctuation_or_symbol(after_char)) ||
                    (cmark_utf8proc_is_CJK(before_char) ||
                     cmark_utf8proc_is_CJK(after_char)));

I used the ranges specified by @rxliuli above for is_CJK.

All current tests pass. I used the following new tests, based on the @tats-u 's comment above:

1. ハイパーテキストコーヒーポット制御プロトコル**（HTCPCP）**
2. 﨑**（崎）**
3. 国際規格**[ECMA-262](https://tc39.es/ecma262/)**
4. 㐧**（第の俗字）** (Unofficial form)
5. 𠮟**（こちらが正式表記）**
6. 𪜈**（トモの合略仮名）** (Mixed-in uncommon joined katakana)
7. 𫠉**（馬の俗字）** (Unofficial form)
8. 谺𬤲**（こだま）**石神社 (shrine)
9. 石𮧟**（いしただら）** (address)

And I got these results:

<ol>
<li>ハイパーテキストコーヒーポット制御プロトコル**（HTCPCP）**</li>
<li>﨑<strong>（崎）</strong></li>
<li>国際規格<strong><a href="https://tc39.es/ecma262/">ECMA-262</a></strong></li>
<li>㐧**（第の俗字）** (Unofficial form)</li>
<li>𠮟<strong>（こちらが正式表記）</strong></li>
<li>𪜈<strong>（トモの合略仮名）</strong> (Mixed-in uncommon joined katakana)</li>
<li>𫠉<strong>（馬の俗字）</strong> (Unofficial form)</li>
<li>谺𬤲**（こだま）**石神社 (shrine)</li>
<li>石𮧟**（いしただら）** (address)</li>
</ol>

1, 4, 8, and 9 are not emphasized because the wide left parenthesis (U+FF08) is not in the ranges @rxliuli specified above. Maybe those ranges are not sufficient?

tats-u commented 5 months ago

@jgm Could you adopt only strings in the inline codes? And you can change the parentheses (U+FF08 & U+FF09) to ASCII ones to take only target characters into account.

https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1939237484

All Unicode blocks in the table in the above comment at least must be treated as CJK.

I fixed the table.

Crissov commented 5 months ago

Is there no sufficient Unicode character property (combination) that could be used instead of listing blocks or ranges?

ArcticLampyrid commented 5 months ago

1, 4, 8, and 9 are not emphasized because the wide left parenthesis (U+FF08) is not in the ranges @rxliuli specified above.

Why? Take a look at ハイパーテキストコーヒーポット制御プロトコル**（HTCPCP）** The first ** is around CJK char ル, which should be left flanking. The second ** is followed by End of line, is this not a valid rignt flanking?

ArcticLampyrid commented 5 months ago

BTW, wide parenthesis should be handled, even if they are not used with CJK chars.

Maybe we should distinguish Open Punctuation and Close Punctuation (Fortunately, Unicode split them in different categories)

https://www.compart.com/en/unicode/category/Ps https://www.compart.com/en/unicode/category/Pe

rxliuli commented 5 months ago

1, 4, 8, and 9 are not emphasized because the wide left parenthesis (U+FF08) is not in the ranges @rxliuli specified above. Maybe those ranges are not sufficient?

@jgm I did work separately on some other symbols commonly used in Chinese.

export function isChineseOrSymbol(s: string): boolean {
  return /[\u4E00-\u9FFF]/.test(s) || '，。、；：？！“”‘’（）【】《》—～…·〃-々'.split('').includes(s)
}

Unit test

export const STRONG_CASE = [
  ['**真，**她', '<p><strong>真，</strong>她</p>'],
  ['**真。**她', '<p><strong>真。</strong>她</p>'],
  ['**真、**她', '<p><strong>真、</strong>她</p>'],
  ['**真；**她', '<p><strong>真；</strong>她</p>'],
  ['**真：**她', '<p><strong>真：</strong>她</p>'],
  ['**真？**她', '<p><strong>真？</strong>她</p>'],
  ['**真！**她', '<p><strong>真！</strong>她</p>'],
  ['**真“**她', '<p><strong>真“</strong>她</p>'],
  ['**真”**她', '<p><strong>真”</strong>她</p>'],
  ['**真‘**她', '<p><strong>真‘</strong>她</p>'],
  ['**真’**她', '<p><strong>真’</strong>她</p>'],
  ['**真（**她', '<p><strong>真（</strong>她</p>'],
  ['**真）**她', '<p><strong>真）</strong>她</p>'],
  ['**真【**她', '<p><strong>真【</strong>她</p>'],
  ['**真】**她', '<p><strong>真】</strong>她</p>'],
  ['**真《**她', '<p><strong>真《</strong>她</p>'],
  ['**真》**她', '<p><strong>真》</strong>她</p>'],
  ['**真—**她', '<p><strong>真—</strong>她</p>'],
  ['**真～**她', '<p><strong>真～</strong>她</p>'],
  ['**真…**她', '<p><strong>真…</strong>她</p>'],
  ['**真·**她', '<p><strong>真·</strong>她</p>'],
  ['**真〃**她', '<p><strong>真〃</strong>她</p>'],
  ['**真-**她', '<p><strong>真-</strong>她</p>'],
  ['**真々**她', '<p><strong>真々</strong>她</p>'],
  ['**真**她', '<p><strong>真</strong>她</p>'],
  ['她**真**', '<p>她<strong>真</strong></p>'],
]

ref: https://github.com/mark-magic/mark-magic/blob/2adb11412a7c7789a66cf572d38cf9b905a4765a/packages/mdast-util-cjk-space-clean/src/utils.ts#L1-L3

ArcticLampyrid commented 5 months ago

[`**真，** 她`, '<p><strong>真，</strong>她</p>']

~~I do not think these things are right.~~

~~Why you insert a space after comma (，, U+FF0C)? As a native Chinese speaker, I do not think it's idiomatic. And why this space are not expected to render? I think the expected output will be 真， 她 (the space before 她 is kept)~~

Note: @rxliuli correct the test cases on 2024-05-08T03:12:20Z.

commonmark / commonmark-spec

Emphasis with CJK punctuation #650