Open ptmkenny opened 4 years ago
It might be better if we can add number digits to "CJK characters", too.
Call the number **(000)**0000-0000. The price is ***€*100** now! (note: not US$ or £!)
Maybe another issue? This problem seems to be encountered much less often.
@rxliuli
This problem seems to be encountered much less often.
Correct, this is just "It would be better if we add them together".
@rxliuli I think it's reasonable to aim to support numbers in CJK to the same extent they are supported in English.
It's important to test both single-byte and double-byte numbers. For example, these should be handled:
IDが**001**になります。
IDが**001**になります。
IDが**001号**になります。
IDが**001号**になります。
It's important to test both single-byte and double-byte numbers.
Note: double-width numbers are in FF00-FFEE // Halfwidth and Fullwidth Forms
.
We have not added some minor ideographs and hentaigana (super very minor even in daily life; I have only seen those on billboards or hanafuda cards (akayoroshi).);
@rxliuli
This problem seems to be encountered much less often.
Correct, this is just "It would be better if we add them together".
Of course, I'm just worried that this issue requires too much support and will fall into stagnation again. After all, this problem has existed for too long.
@wooorm 's approach may not work for like Go**「初心者」**を対象とした記事です。
.
As I've wrote
It seems impossible to perfect the emphasis notation where the start and end symbols are the same
It is not a perfect, but makes better for CJK.
And it is not 'This is an extremely simple
' for CommonMark parser authors because a character definition that should ignore surrounding space is complicated.
However, as we saw it looks like CommonMark owners will not to added new notation like an escaped space, wooorm's approach seems the only possible CommonMark spec solution for CJK .
goldmark already has a character definition which should ignore surrounding spaces: https://github.com/yuin/goldmark/blob/master/util/util_cjk.go#L192
This definition is based on W3C 'Space-Discarding Unicode Characters', so it should work to some extent for non-CJK languages that have same trouble with CJK (I'm not sure). Note that this definition most likely inperfect(There's probably an ongoing discussion somewhere).
By the way, If this kind of approach is acceptable, we may have a chance to resolve 'unnecessary space' problem in CJK. CJK users often see 'unnecessary spaces' because newline is rendered as a space. goldmark CJK extension removes a newline based on above 'Space-Discarding Unicode Characters' list.
we may have a chance to resolve 'unnecessary space' problem in CJK.
Discussion of that should probably go in a separate issue. Pandoc also has an option to ignore newlines, as well as an option to ignore space adjacent to CJK characters. But these things don't require a change to the parser; they can be implemented as renderer options.
Discussion of that should probably go in a separate issue.
I agree. So I avoid discussing the details, 'they can be implemented as renderer options
' but it is not in the CommonMark spec, so many library does not implement this. If this in the CommonMark spec, many library will implement it. In addition, pandoc implementation is too naive that has a lot of missing cases.
Go**「初心者」**を対象とした記事です。
This it it.
Runs can open if the character before them is CJK
Runs can open if the character at least one of before them, after them (=punctuation itself), or after after them (=after puctuation) is CJK
Runs can open if the character at least one of before them, at least one punctuation of the punctuation sequence (=punctuation itself), or after the punctuation sequence (=after puctuation) is CJK
↑Changing to this increases the coverage range, but parsers will be more complex because they are required to be able to remember multiple previous chracters and peek multiple characters after.
Replies to questions from earlier proposals:
Have even numbers of
\'
s already been accepted […] — @tats-u
No, escapes or not do not affect the emphasis/strong algo currently.
You can see this with character references such as  
(which would be a space) or a
(which would be an a
): those “would be” character are not used. The &
and ;
instead matter.
This behavior may sound useless, but such behavior is what allows users to work around this very issue, for languages that do not use whitespace for example.
[*...*]
can be interpreted as a combination of a footnote label[...]
and an emphasis. — @tats-u
Those are not footnote labels. GFM, which includes footnotes, does not use display the value in footnote labels, which are written as [^a]
. Anything after ^
and before ]
is dropped. And it cannot include whitespace.
And yes, extensions would have to deal with that, but I don’t worry about directives or footnotes!
Re numbers:
Correct, this is just "It would be better if we add them together". — @tats-u
The downside is that there is a reason for all this detection: people also use _
and *
in words and want to see them.
I get that theoretically someone might want **(000)**0000-0000.
and ***€*100**
to be bold. But also like... I can also come up with cases where people want *
in a word (n*2=4
or so) We don’t want to go too far and see everything as a boundary.
I think it's reasonable to aim to support numbers in CJK to the same extent they are supported in English. @ptmkenny
These examples of numbers are supported by my proposal, as they are surrounded by CJK. They’re at boundaries already.
Re space-discarding characters:
This definition is based on W3C 'Space-Discarding Unicode Characters', so it should work to some extent for non-CJK languages that have same trouble with CJK (I'm not sure). Note that this definition most likely inperfect(There's probably an ongoing discussion somewhere). @yuin
This definitions has been removed from CSS. Do you have insights on why it was removed, and what replaced it?
Runs can open if the character at least one of before them, at least one punctuation of the punctuation sequence (=punctuation itself), or after the punctuation sequence (=after puctuation) is CJK — @tats-u
I don’t understand this sentence. I do not know what you mean. Can you please rephrase it?
And re CJK around line endings: I don’t think this needs to be discussed here, now. And, if it was discussed, I probably think it has no place in CM. Proof of that is that CSS is solving it. Because in HTML too people use line endings to hard wrap text. It doesn’t require a change to the HTML spec. But to CSS.
@wooorm
(This is the last reply about line endings. If necessary, let's set up a separate issue for discussion)
If I understand current situation correctly, you misunderstood a little. CSS do not resolve CJK line ending issues for now. This kind of Segment Break Transformation Rules is not implemented in chronium and most browsers.
Ideally, we hope many browsers implement it. But we need a practical solution. From you westerners point of view, "Please be patient, it will be implemented someday", but often "someday" about this kind of non-westerner problems is the distant future or never coming(like this emphasis issue).
But I do not expect that this kind of things defined in CommonMark. So I've implemented this as an extension for my library. I only meant to touch up this topic lightly.
Sorry for the confusion :pray: We got sidetracked, so let's get back to the main subject.
This definitions has been removed from CSS. Do you have insights on why it was removed, and what replaced it?
I do not know details about this. But considering above situation, I think that they seems almost gave up this sort of approach. There are many threads left open about this on github. e.g. : https://github.com/w3c/csswg-drafts/issues/5017
How big of a world we live in, so many languages exist, 150,000 characters defined in Unicode, this kind of 'if XXX character is before...' approach seems thorny path.
However, in the discussions so far this kind of approach seems the only possible CommonMark spec solution for CJK . We CJK people would be happy if this kind of solution become CommonMark spec rather than nothing happen about this issue.
I obviously don't know all language in the world, but I think goldmark's character list will be helpful for this kind of approach.
@tats-u - I assume this was meant to be a counterexample to the original proposal?
Go**「初心者」**を対象とした記事です。
What about this?
CJK characters, including letters, digits, and symbols and punctuation, are all treated as equivalent to letters for purposes of determining flankingness.
With that change your example should be handled fine. You'd only run into problems if you used non-CJK punctuation, e.g.
Go**"初心者"**を対象とした記事です。
I feel like you misread my comment. The work happens in CSS and not HTML. That’s my point. For the rest of your comments, I join you in wanting to burn the imperialist west down.
I do not know details about this
I tracked it down and feedback from CJK users showed that the algorithm was not good.
OK, made a table of characters that are currently being discussed.
basic list | w3c | Unicode Range | Description | Current classification |
---|---|---|---|---|
x | x | 2e80-2eff | CJK Radicals Supplement | punctuation |
x | x | 2f00-2fdf | Kangxi Radicals | punctuation |
x | 2FF0-2FFF | eographic Description Characters | punctuation | |
x | x | 3000-303F | JK Symbols and Punctuation | mixed punctuation and other |
x | x | 3040-309f | Hiragana | mostly other; some punctuation |
x | x | 30a0-30ff | Katakana | other |
x | 3100-312f | Bopomofo | other | |
x | 3130-318F | Kanbun | other | |
x | 3190-319F | Kanbun | mostly other; some punctuation | |
x | 31C0-31EF | CJK Strokes | punctuation | |
x | 31F0-31FF | Katakana Phonetic Extensions | other | |
x | 3200-32ff | Enclosed CJK Letters and Months | mostly punctuation; some other | |
x | 3300-33FF | CJK Compatibility | punctuation | |
x | x | 3400-4dbf | CJK Unified Ideographs Extension A | other |
x | x | 4e00-9fff | CJK Unified Ideographs | other |
x | A000-A48F | Yi Syllables | other | |
x | A490-A4CF | Yi Radicals | punctuation | |
x | x | f900-faff | CJK Compatibility Ideographs | other |
x | FE10-FE1F | Vertical Forms | punctuation | |
x | FE30-FE4F | CJK Compatibility Forms | punctuation | |
x | FE50-FE6F | Small Form Variants | punctuation | |
x | x | FF00-FFEE | Halfwidth and Fullwidth Forms | mixed punctuation and other |
x | 1B000-1B0FF | Kana Supplement | other | |
x | 1B100-1B12F | Kana Extended-A | other | |
x | 1B130-1B16F | Small Kana Extension | other | |
x | 20000-2A6DF | CJK Unified Ideographs Extension B | mostly other; some punctuation | |
x | 2A700-2B73F | CJK Unified Ideographs Extension C | mostly other; some punctuation | |
x | 2B740-2B81F | CJK Unified Ideographs Extension D | other | |
x | 2B820-2CEAF | CJK Unified Ideographs Extension E | other | |
x | 2CEB0-2EBEF | CJK Unified Ideographs Extension F | other | |
x | 2F800-2FA1F | CJK Compatibility Ideographs Supplement | other | |
x | 30000-3134F | CJK Unified Ideographs Extension G | mostly other; some punctuation |
And a table of examples.
Question for CJK speakers: can you come up with real sentences that contain these characters, and figure out a test case that currently does not work with https://spec.commonmark.org/dingus/
Unicode Range | Description | Example | Solved with proposal? |
---|---|---|---|
2e80-2eff | CJK Radicals Supplement | **テスト。**テスト |
Yes |
2f00-2fdf | Kangxi Radicals | ||
2FF0-2FFF | Ideographic Description Characters | ||
3000-303F | JK Symbols and Punctuation | 太郎は**「こんにちわ」**といった |
Yes |
3040-309f | Hiragana | **[リンク](https://example.com)**も注意。 , 先頭の** コードも注意。** |
Yes |
30a0-30ff | Katakana | ||
3100-312f | Bopomofo | ||
3130-318F | Kanbun | ||
3190-319F | Kanbun | ||
31C0-31EF | CJK Strokes | ||
31F0-31FF | Katakana Phonetic Extensions | ||
3200-32ff | Enclosed CJK Letters and Months | ||
3300-33FF | CJK Compatibility | これは100**㌍**です。 |
No |
3400-4dbf | CJK Unified Ideographs Extension A | ||
4e00-9fff | CJK Unified Ideographs | ||
A000-A48F | Yi Syllables | ||
A490-A4CF | Yi Radicals | ||
f900-faff | CJK Compatibility Ideographs | ||
FE10-FE1F | Vertical Forms | ||
FE30-FE4F | CJK Compatibility Forms | ||
FE50-FE6F | Small Form Variants | ||
FF00-FFEE | Halfwidth and Fullwidth Forms | **テスト?**テスト , カッコに注意**(太字にならない)**文が続く場合に要警戒。 |
Yes |
1B000-1B0FF | Kana Supplement | ||
1B100-1B12F | Kana Extended-A | ||
1B130-1B16F | Small Kana Extension | ||
20000-2A6DF | CJK Unified Ideographs Extension B | ||
2A700-2B73F | CJK Unified Ideographs Extension C | ||
2B740-2B81F | CJK Unified Ideographs Extension D | ||
2B820-2CEAF | CJK Unified Ideographs Extension E | ||
2CEB0-2EBEF | CJK Unified Ideographs Extension F | ||
2F800-2FA1F | CJK Compatibility Ideographs Supplement | ||
30000-3134F | CJK Unified Ideographs Extension G |
Great wrap-up.
@jgm 's example Go**"初心者"**を対象とした記事です。
is not working in commonmark dingus. Where should be it in your table?
@yuin is this a realistic example? Would people use "
in this context? I suppose another similar case would use a yen symbol after the **
.
No that example doesn’t get solved by this approach or CM currently. Yes, as far as I understand it’s a real use case: folks should use CJK quotes but from what I read in Unicode more and more people are sometimes a bit more lax and just use easy to type characters.
reiterating: I am looking for more examples. Please, CJK users, look at the above table and see if you can come up with sentences that contain those characters and things that could be emphasized :)
@jgm Yes. We also use half-width symbols without spaces(sciences, tech/technical people prefer halfwidth symbols). However, I think halfwidth symbols with emphasis is a rare case, so I think it can be ignored rather than nothing happens for this issue.
Usually, we use words without character codes in mind, so it's quite difficult...
**⻲田太郎**と申します
(⻲
is it)コード
is a katakana・**㋐**:選択肢1つ目
㋐
is itこれは100**㌍**です。
㌍
is it.
㌍
is a shorthand for the word カロリー
. 3 characters in 1 character.これは**テスト**です。
テスト
is halfwidth.It probably works for noticed(already listed) categories(at least Japanese). The Hard parts may be
As I mentioned earlier, It would be happy if this kind of solution become CommonMark spec for Japanese! For hard(and rare) parts, we can use HTML tags.
The main issue with Chinese text is the occurrence of symbols, such as the common symbols in Chinese listed below. If they appear before the right side of bold or italic text, it causes rendering errors.
example | render |
---|---|
**真,**她 |
真,她 |
**真。**她 |
真。她 |
**真、**她 |
真、她 |
**真;**她 |
真;她 |
**真:**她 |
真:她 |
**真?**她 |
真?她 |
**真!**她 |
真!她 |
**真“**她 |
真“她 |
**真”**她 |
真”她 |
**真‘**她 |
真‘她 |
**真’**她 |
真’她 |
**真(**她 |
真(她 |
**真)**她 |
真)她 |
**真【**她 |
真【她 |
**真】**她 |
真】她 |
**真《**她 |
真《她 |
**真》**她 |
真》她 |
**真—**她 |
真—她 |
**真~**她 |
真~她 |
**真…**她 |
真…她 |
**真·**她 |
真·她 |
**真〃**她 |
真〃她 |
**真-**她 |
真-她 |
**真々**她 |
真々她 |
**真**她 |
真她 |
There are some workarounds, but they are not intuitive, and without checking the rendering result, it would not be possible to know.
example | render |
---|---|
**真,** 她 |
真, 她 |
**真**,她 |
真,她 |
**真,**​她 |
真,​她 |
Comma (,): Used for a brief pause within a sentence. Period (。): Used to end a sentence. Enumeration comma (、): Used between parallel terms. Semicolon (;): Used to separate clauses within a complex sentence, indicating a pause longer than a comma. Colon (:): Used to introduce explanations, reasons, results, etc. Question mark (?): Used at the end of interrogative sentences. Exclamation mark (!): Used to express strong emotions. Quotation marks (“” and ‘’): Used to directly quote someone's words or indicate specific nouns. Parentheses (() and 【】): Used to insert supplementary explanations or comments. Book title marks (《》): Used for the titles of books, articles, etc. Dash (——): Used for emphasis or explanation. Tilde (~): Indicates range or fluctuation. Ellipsis (……): Indicates interruption or omission of speech. Interpunct (·): Used between names or to separate parts. Emphasis mark (〃): Used to emphasize a certain word or phrase. Hyphen (-): Used to connect words. Ditto mark (々): Indicates the immediate repetition of the preceding character or word.
@rxliuli
All your example seems resolved by the propsal :) .
My Note:
My thoughts about nesting. Some nesting still works in different way:
this is **foo* bar.* baz
これは**ほげ*ふが。*です
<p>this is <em><em>foo</em> bar.</em> baz</p>
<p>これは**ほげ<em>ふが。</em>です</p>
I tried many different ways(inserting spaces anywhere), It is hard to make same emphasis as English. But honestly say, I have never seen nested emphasis in production. So I don't think this is a problem.
@yuin
we use words without character codes in mind, so it's quite difficult...
If you look up the label on Wikipedia, so “CJK Radicals Supplement” or so, you can see the character there.
Can you come up with:
**⻲
in **⻲田太郎**と申します
(CJK radicals)・
in ・**㋐**:選択肢1つ目
(Enclosed CJK Letters and Months)
Then it currently doesn’t form, but will form.Vertical Forms: Can not use in CommonMark(I believe). We sometimes write in 'vertical', but CommonMark only consider 'horizontal' forms.
Why not? I read LTR and RTL languages and those are just bytes, why would things that finally end up being displayed vertically not work?
The Hard parts may be
mixed with other language alphabets and symbols Other languages we haven't noticed (e.g. Thai?)
Yep, I’ll search for Burmese / Myanmar, Dzongkha, Javanese, Khmer, Lao, Tai Lue, Thai, and Tibetan speakers next :)
As I mentioned earlier, It would be happy if this kind of solution become CommonMark spec for Japanese! For hard(and rare) parts, we can use HTML tags.
Yes, working on it! But first I want to get some good examples. So that we know what doesn’t work but should realisticly and reasonably work.
@rxliuli Thanks! Some of these are indeed covered by this proposal. And have examples already (3000-303F: JK Symbols and Punctuation and FF00-FFEE: Halfwidth and Fullwidth Forms).
Some others, such as —
U+2018 (‘
), U+2019 (’
), U+201c (“
), U+201d (”
) are not covered. Because they are not specific to CJK. As far as I understand they shouldn’t (much) be used in CJK languages, as there are better alternatives?
Some of these examples look theoretical to my eye. For **真“**她
for example, that looks like someone opened a quote but didn’t close it. Why would someone put emphasis around a broken quote? I am really looking for well-formed sentences where the emphasis makes sense: where the emphasis is around a “whole unit”.
Some others, such as — U+2018 (‘), U+2019 (’), U+201c (“), U+201d (”) are not covered. Because they are not specific to CJK. As far as I understand they shouldn’t (much) be used in CJK languages, as there are better alternatives?
From what I understand, these symbols are often used in Chinese, and there really aren't alternative symbols. As mentioned above, they are used to express spoken content or quote others, which is very common.
Some of these examples look theoretical to my eye. For 真“她 for example, that looks like someone opened a quote but didn’t close it. Why would someone put emphasis around a broken quote? I am really looking for well-formed sentences where the emphasis makes sense: where the emphasis is around a “whole unit”.
Yes, "真“她" is not a good example; it would generally be "真”她", but it's just mentioned in passing, after all, they should be some of the symbols adjacent to each other in Unicode, right?
If you're looking for real-world examples, then a Chinese novel I previously maintained has 2711 occurrences of this, which is very common. The solution I'm currently using is to add extra spaces. But as mentioned above, this is not intuitive for maintainers.
Regex Search \*\*.*?[,。、;:?!“”‘’()【】《》—~…·〃-々]\*\*
@wooorm
Can you come up with: a non-punctuation character before ⻲ in ⻲田太郎と申します (CJK radicals) a non-punctuation character instead of ・ in ・㋐**:選択肢1つ目 (Enclosed CJK Letters and Months) Then it currently doesn’t form, but will form.
私は**⻲田太郎**と申します
(CJK radicals)選択肢**㋐**: 1つ目の選択肢
(Enclosed CJK Letters and Months)Why not? I read LTR and RTL languages and those are just bytes, why would things that finally end up being displayed vertically not work?
That makes sense. I tried it and it works! But it can not be rendered correctly on Github, I've used the dingus: Link
**さようなら︙**と太郎はいった。
: ︙
is a Vertical Forms( of …
).Runs can open if the character before them is CJK and close if the character after is CJK.
Originally posted by @wooorm in https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1937082248
I approve this proposal. This will resolve a lot of problems for Chinese users.
I don’t understand this sentence. I do not know what you mean. Can you please rephrase it?
However, parsers will become more complex and the explainability of the algorithm for those other than library authors will be decreased.
Also, this still cannot fix the following example:
.NET**(.NET Frameworkは不可)**では、
n*2=4
This made me give up for numbers. I thought they should escape *
though.
No, escapes or not do not affect the emphasis/strong algo currently.
I confirmed in dingus.
Theoretically we can extend HTML escape like &noop;
and replace it with an empty string, which looks as bad as ZWSP though.
Would people use
"
in this context?
“
& ”
(U+201C–201D) are much commoner than "
. If we type "
key (Shift + 2 in a Japanese keyboard) in keyboard when a Japanese IME is on, we get ”
(U+201D).
Kana Supplement
Here is Kana Supplement example (requires IPA mj Mincho to show):
あ𛀙**(か)**よろし
↑Image source: https://www.tengudo.jp/blog/karuta-news/323.html (Japanese)
As far as I understand they shouldn’t (much) be used in CJK languages,
These have caused much trouble in macOS. I have created a Rust library to handle them in HFS+. Most WIndows and Linux users do not have the opportunity to handle them.
CJK Radicals Supplement
I produced the following example that was likely to contain them using Google Japanese Input:
「禰󠄀」の偏は示ではなく**礻**です。
However, 示
& 礻
here are normal kanjis (CJK Unified Ideograph). I wonder how normal Japanese input them from romaji.
https://www.benricho.org/moji_conv/14-kanji_koukibusyucheck/ (Japanese)
As far as I understand they shouldn’t (much) be used in CJK languages, as there are better alternatives?
Sometimes appear in books unrelated to computer. No alternatives. FYI I prefer 「」
to “”
and 『』
to ‘’
, but someone tell “”
from 「」
.
I approve this proposal. This will resolve a lot of problems for Chinese users.
Some text does not work as intended:
Git**(注:不是GitHub)**
It seems impossible to fix this issue completely in a simple way. Depends on how much we can compromise.
**⻲田太郎**と申します
亀 (U+4E80) is so much commoner here. I doubt that someone use U+2EF2.
**さようなら︙**と太郎はいった。
:︙
is ...
We can convert from …
to that character thanks to CSS in vertical writing mode:
https://codepen.io/tats-u/pen/oNVmNvY
I think we do not have to consider such characters dedicated to vertical writing while we use CSS to render text.
I have tracked this issue all the way, and many people have raised it. I encountered it when using outline as well. I understand that the fundamental reason for this unexpected rendering is that the logic for determining whether **
can be closed is incorrect.
For example, in the case of **aa+**
, it is recognized as a strong tag, but in the case of **aa+**a
, it directly outputs **
. The difference lies in whether there is whitespace or a character at the end. If there's whitespace, then it can be closed.
The fundamental problem is that the parsing process does not consider the previous state. I think a more reasonable approach would be: if there's an open `before, then encountering another
` later should always allow closing. Based on this principle, I also implemented a markdown-it plugin at https://github.com/shepherdwind/match-pairs . With this plugin, most similar issues can be resolved and all old markdown-it test cases pass.
I understand that the fundamental reason for this unexpected rendering is that the logic for determining whether
**
can be closed is incorrect.
No, the fundamental reason is that the spec is designed to supports nested emphasis, and then introduces a adaptive method to detect whether **
is used for opening tag or closing tag.
Take a look at **aa+**a****
, it should render as <strong>aa+<strong>a</strong></strong>
. Yes, a <strong>
tag in another <strong>
tag.
I don't think this is an important feature, but the specification does.
if there's an open before, then encountering another later should always allow closing.
Then nested emphasis won't be supported any more.
Thank you for the reminder. I checked the spec document and indeed it supports nesting. I think we can adjust this part of the logic as follows:
If there's an open **
before, then encountering another **
later, we need to check if there is a closable tag after this **
. If so, keep it unchanged; otherwise, this **
should be closable.
Just add another condition to ensure compatibility with the old logic. I think this should work. I'll try it out and add some commonmark test cases.
No, the fundamental reason is that the spec is designed to supports nested emphasis, and then introduces a adaptive method to detect whether ** is used for opening tag or closing tag.
Let me add to that "adaptive" method is designed as a heuristic (currently) only based on some space separated western languages. This method consists of 17 rules. It is very hard to realize these rules for writers when their markup is not rendered as expected. Take the matter further, original Markdown converts
**aa+**a
-> <strong>aa+</strong>a
太郎は**「こんにちわ」**といった。
-> 太郎は<strong>「こんにちわ」</strong>といった。
**test*aaa***
-> <strong>test<em>aaa</em></strong>
: original Markdown supports nested emphasis in some casesThis means emphasis in CommonMark is incompatible with original Markdown, and this issue was introduced by CommonMark.
It is very hard to realize these rules for writers when their markup is not rendered as expected.
I agree. It's worse that CJK did not use space or any other char to split words.
If only some special uses are required to handle, I would advice to introduce escape sequences.
But if most CJK users must use escape sequences in almost every time for emphasis, it's unacceptable. For this issue, it's required to make changes other than introducing escape sequences.
For me, dropping support of nested emphasis is more acceptable than I must use escape sequences or HTML tags <strong>aaa</strong>
every time.
the fundamental reason is that the spec is designed to supports nested emphasis
This is why this problem can't be completely solved. We have to have to compromise to some extent.
I decided to prioritize the candidates of solutions; the easiest and safest one is to check only if either of the nearest characters to **
(or *
) are CJK.
↑ We don't have to check other characters than these two.
I'll put other plans/ideas on ice because they're less efficient.
We can treat all characters in U+20000–U+3FFFF as CJK because both planes are reserved for minor han characters.
Here is an example for Ideographs Extension G: 𰻞𰻞**(ビャンビャン)**麺
.
However H is also available now so I'm concerned about such work may be endless.
Block | Example | Remarks |
---|---|---|
Katakana | ハイパーテキストコーヒーポット制御プロトコル**(HTCPCP)** |
|
CJK Compatibility Ideographs | 﨑**(崎)** |
"﨑" is an unofficial form but sometimes used in last names instead of "崎" |
CJK Unified Ideographs | 国際規格**[ECMA-262](https://tc39.es/ecma262/)** |
|
CJK Unified Ideograph Extensions A | 㐧**(第の俗字)** |
Unofficial form |
CJK Unified Ideograph Extensions B | 𠮟**(こちらが正式表記)** |
|
CJK Unified Ideograph Extensions C | 𪜈**(トモの合略仮名)** |
Mixed-in uncommon joined katakana |
CJK Unified Ideograph Extensions D | 𫠉**(馬の俗字)** |
Unofficial form |
CJK Unified Ideograph Extensions E | 谺𬤲**(こだま)**石神社 |
shrine |
CJK Unified Ideograph Extensions F | 石𮧟**(いしただら)** |
address |
This is a very complex discussion. I think something along the lines of what @tats-u suggests is a practical and minimalist solution. As I understand it, the idea would be that a delimiter run, e.g. **
, would be treated as both left- and right-flanking if either of the surrounding characters is CJK. We'd need to specify which code points count as CJK.
Is that correct @tats-u ? If so, let me know what a sensible CJK code point range is and I could try implementing this in cmark to see how it works.
Posting https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1939190815 again. I was still waiting for more examples from speakers of various languages.
@jgm
both left- and right-flanking if either of the surrounding characters is CJK. We'd need to specify which code points count as CJK.
Your recognition is correct. We may be able to enlarge the range to *
, too.
We may have to add more complex rules to deal with cases that can't be handled with this rule in the future, but we haven't known their severity until now.
Some Unicode blocks are too minor to be used on a daily basis, so we can use tentative examples using characters codepoints like 𠮟**(U+20B9F)**
. (Parenthesis will be better if ASCII)
Yi characters aren't used by Mandarin. It'll be hard to find people who use them on a daily basis in GitHub.
Can you give me the code point ranges you think we should be treating as CJK? (the actual numbers, not just names)
Can you give me the code point ranges you think we should be treating as CJK? (the actual numbers, not just names)
I generally use /[\u4E00-\u9FFF]/
to determine if it's a CJK character, but let's hear what @tats-u has to say about it.
BTW, I wonder which is better to maintain, allowlist or denylist?
Enable the adaptive rule only if it's around western languages, or just disable it only if it's around CJK chars?
IMO, I prefer "Enable the adaptive rule only if it's around western languages", bacause:
Implemented this in cmark to test the idea:
left_flanking = numdelims > 0 && !cmark_utf8proc_is_space(after_char) &&
((!cmark_utf8proc_is_punctuation_or_symbol(after_char) ||
cmark_utf8proc_is_space(before_char) ||
cmark_utf8proc_is_punctuation_or_symbol(before_char)) ||
(cmark_utf8proc_is_CJK(before_char) ||
cmark_utf8proc_is_CJK(after_char)));
right_flanking = numdelims > 0 && !cmark_utf8proc_is_space(before_char) &&
((!cmark_utf8proc_is_punctuation_or_symbol(before_char)
|| cmark_utf8proc_is_space(after_char) ||
cmark_utf8proc_is_punctuation_or_symbol(after_char)) ||
(cmark_utf8proc_is_CJK(before_char) ||
cmark_utf8proc_is_CJK(after_char)));
I used the ranges specified by @rxliuli above for is_CJK.
All current tests pass. I used the following new tests, based on the @tats-u 's comment above:
1. ハイパーテキストコーヒーポット制御プロトコル**(HTCPCP)**
2. 﨑**(崎)**
3. 国際規格**[ECMA-262](https://tc39.es/ecma262/)**
4. 㐧**(第の俗字)** (Unofficial form)
5. 𠮟**(こちらが正式表記)**
6. 𪜈**(トモの合略仮名)** (Mixed-in uncommon joined katakana)
7. 𫠉**(馬の俗字)** (Unofficial form)
8. 谺𬤲**(こだま)**石神社 (shrine)
9. 石𮧟**(いしただら)** (address)
And I got these results:
<ol>
<li>ハイパーテキストコーヒーポット制御プロトコル**(HTCPCP)**</li>
<li>﨑<strong>(崎)</strong></li>
<li>国際規格<strong><a href="https://tc39.es/ecma262/">ECMA-262</a></strong></li>
<li>㐧**(第の俗字)** (Unofficial form)</li>
<li>𠮟<strong>(こちらが正式表記)</strong></li>
<li>𪜈<strong>(トモの合略仮名)</strong> (Mixed-in uncommon joined katakana)</li>
<li>𫠉<strong>(馬の俗字)</strong> (Unofficial form)</li>
<li>谺𬤲**(こだま)**石神社 (shrine)</li>
<li>石𮧟**(いしただら)** (address)</li>
</ol>
1, 4, 8, and 9 are not emphasized because the wide left parenthesis (U+FF08) is not in the ranges @rxliuli specified above. Maybe those ranges are not sufficient?
@jgm Could you adopt only strings in the inline codes? And you can change the parentheses (U+FF08 & U+FF09) to ASCII ones to take only target characters into account.
https://github.com/commonmark/commonmark-spec/issues/650#issuecomment-1939237484
All Unicode blocks in the table in the above comment at least must be treated as CJK.
I fixed the table.
Is there no sufficient Unicode character property (combination) that could be used instead of listing blocks or ranges?
1, 4, 8, and 9 are not emphasized because the wide left parenthesis (U+FF08) is not in the ranges @rxliuli specified above.
Why? Take a look at ハイパーテキストコーヒーポット制御プロトコル**(HTCPCP)**
The first **
is around CJK char ル
, which should be left flanking. The second **
is followed by End of line, is this not a valid rignt flanking?
BTW, wide parenthesis should be handled, even if they are not used with CJK chars.
Maybe we should distinguish Open Punctuation and Close Punctuation (Fortunately, Unicode split them in different categories)
https://www.compart.com/en/unicode/category/Ps https://www.compart.com/en/unicode/category/Pe
1, 4, 8, and 9 are not emphasized because the wide left parenthesis (U+FF08) is not in the ranges @rxliuli specified above. Maybe those ranges are not sufficient?
@jgm I did work separately on some other symbols commonly used in Chinese.
export function isChineseOrSymbol(s: string): boolean {
return /[\u4E00-\u9FFF]/.test(s) || ',。、;:?!“”‘’()【】《》—~…·〃-々'.split('').includes(s)
}
Unit test
export const STRONG_CASE = [
['**真,**她', '<p><strong>真,</strong>她</p>'],
['**真。**她', '<p><strong>真。</strong>她</p>'],
['**真、**她', '<p><strong>真、</strong>她</p>'],
['**真;**她', '<p><strong>真;</strong>她</p>'],
['**真:**她', '<p><strong>真:</strong>她</p>'],
['**真?**她', '<p><strong>真?</strong>她</p>'],
['**真!**她', '<p><strong>真!</strong>她</p>'],
['**真“**她', '<p><strong>真“</strong>她</p>'],
['**真”**她', '<p><strong>真”</strong>她</p>'],
['**真‘**她', '<p><strong>真‘</strong>她</p>'],
['**真’**她', '<p><strong>真’</strong>她</p>'],
['**真(**她', '<p><strong>真(</strong>她</p>'],
['**真)**她', '<p><strong>真)</strong>她</p>'],
['**真【**她', '<p><strong>真【</strong>她</p>'],
['**真】**她', '<p><strong>真】</strong>她</p>'],
['**真《**她', '<p><strong>真《</strong>她</p>'],
['**真》**她', '<p><strong>真》</strong>她</p>'],
['**真—**她', '<p><strong>真—</strong>她</p>'],
['**真~**她', '<p><strong>真~</strong>她</p>'],
['**真…**她', '<p><strong>真…</strong>她</p>'],
['**真·**她', '<p><strong>真·</strong>她</p>'],
['**真〃**她', '<p><strong>真〃</strong>她</p>'],
['**真-**她', '<p><strong>真-</strong>她</p>'],
['**真々**她', '<p><strong>真々</strong>她</p>'],
['**真**她', '<p><strong>真</strong>她</p>'],
['她**真**', '<p>她<strong>真</strong></p>'],
]
[`**真,** 她`, '<p><strong>真,</strong>她</p>']
I do not think these things are right.
~~Why you insert a space after comma (,
, U+FF0C)?
As a native Chinese speaker, I do not think it's idiomatic.
And why this space are not expected to render?
I think the expected output will be <p><strong>真,</strong> 她</p>
(the space before 她
is kept)~~
Note: @rxliuli correct the test cases on 2024-05-08T03:12:20Z.
Hi, I encountered some strange behavior when using CJK full-width punctuation and trying to add emphasis.
Original issue here
Example punctuation that causes this issue:
。!?、
To my mind, all of these should work as emphasis, but some do and some don't:
I'm not sure if this is the spec as intended, but in Japanese, as a general rule there are no spaces in sentences, which leads to the following kind of problem when parsing emphasis.
In English, this is emphasized as expected:
This is **what I wanted to do.** So I am going to do it.
But the same sentence emphasized in the same way in Japanese fails:
これは**私のやりたかったこと。**だからするの。