flipacholas / Architecture-of-consoles

Technical articles about console architecture
https://www.copetti.org/writings/consoles/
Creative Commons Attribution 4.0 International
854 stars 59 forks source link

Improving CJK Characters Support #186

Open Cerallin opened 1 year ago

Cerallin commented 1 year ago

I am working on translating the Nintendo DS article into Chinese.

There are 2 minor issues about CJK characters (Chinese, Japanese, and Korean) that I want to ask about.

Space after sentences

The commas and periods in Chinese are as wide as two Latin characters, the same as all the other Chinese characters. Therefore, we do not add spaces after periods and commas. Avoiding unnecessary spaces is easy when writing with markdown: do not append a new line or add a space after each sentence.

I would like to know if it is convenient to remove these spaces after sentences in the Chinses translated markdown file. I'm not sure if this breaks the translation workflow of Crowdin. (It does not matter that much, so don't worry if not possible.)

Spaces between CJK characters and Latin characters

It is suggested to add "padding" between CJK characters and Latin characters. The simplest and best way to do this is to import a js file. I would like to know if it is convenient to import it into your project. If not, as a compromise, I will manually add spaces when translating.

flipacholas commented 1 year ago

Hi, thanks for the translations. Just in case, there's been a previous translation that could possibly be used as a reference.

Regarding space after sentences, in general, I encourage translators to apply their best judgement. In this case, Crowdin separates translations by sentences, so I think it's autommatically adding the spaces after periods and/or commas. I don't think it will allow you to fix that issue by yourself, so I'll try to see if there's an option somewhere that fixes it.

Regarding spaces between CJK characters and Latin characters, I've noticed the previous NES translations uses characters that include padding (i.e. and ) , would that solve the issue?.

Cerallin commented 1 year ago

Thanks for your reply. I've seen the previous translation. I am not going to change anything on that page because I'm not familiar with NES at all. But I will try to be consistent with its conventions.

Regarding spaces between CJK characters and Latin characters, I've noticed the previous NES translations uses characters that include padding (i.e. and ) , would that solve the issue?.

Unfortunately, that's another issue. The usage of different brackets (() or ()) may seem complex in Chinese-English mixed text, but I will take care of it.

Adding spaces is what the NES translations actually did. e.g., This is an article about GBA will be translated into 这是关于 GBA 的文章, with spaces surrounding GBA. That's a compromise because a space character is a little too wide. What I usually do is add custom HTML tags and set their width to 0.8em.

The best solution I've thought of is to import js files to Chinese-translated markdown files only, and there are alternatives too: a script that processes text (if it's possible), or a Crowdin app/plugin (I don't know much about it yet).

flipacholas commented 1 year ago

Hmm, now that I think of, when the website is generated, the Markdown is converted into HTML, but in-between the conversion I can add regex calls. So, as a more reliable alternative, I can try to come up with a regular expression that detects chinese characters next to Latin characters and, with that, adds a space in there (or an HTML tag). This wouldn't conflict with Crowdin, Pandoc and even work without JS.

So, while I experiment with this in the build scripts, could you try to make the translation without using the extra spaces? Hopefully this will work and I'll be able to port it to other CJK scripts. Thanks!

Cerallin commented 1 year ago

I can try to come up with a regular expression that detects chinese characters next to Latin characters and, with that, adds a space in there (or an HTML tag). This wouldn't conflict with Crowdin, Pandoc, and even work without JS.

I prefer workarounds without JS running in browsers too. I realized that your articles will be published not only on the website but also on EPUB using pandoc. In fact, there's no need to insert spaces in EPUB, because most of e-book readers can take care of the padding. On the other hand, spaces after sentences still need to be dropped.

Trim off spaces after sentences seem simple to me: just remove all the spaces (and line-feeds) after and , characters stand for comma and period, separately.

Now please let me introduce the regex rules for adding HTML tags with JS. The codes below are part of my hexo plugin. Please feel free to use or modify them, and I hope I can explain them clearly.

// Pattern rules taken from text-autospace.js
const hanzi = '[\u2E80-\u2FFF\u31C0-\u31EF\u3300-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF\uFE30-\uFE4F]',
  punc = {
    base: "[@&=_\\$%\\^\\*-\\+/]",
    open: "[\\(\\[\\{<‘“]",
    close: "[,\\.\\?!:\\)\\]\\}>’”]"
  },
  latin = '[A-Za-z0-9\u00C0-\u00FF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]' + '|' + punc.base,
  patterns = [
    RegExp('(' + hanzi + ')(' + latin + '|' + punc.open + ')', 'gi'),
    RegExp('(' + latin + '|' + punc.close + ')(' + hanzi + ')', 'gi')
  ];

Here are the explanations of each variable:

  1. hanzi: matches Chinese characters (but not all CJK characters).
  2. punc: punctuation characters.
  1. latin: matches Latin characters and basic punctuation characters
  2. patterns: determine where to insert a space.

Assume that tags named <hl> are added between Chinese characters and Latin characters, here's the corresponding stylesheet:

html hl:after {
    content: ' ';
    display: inline;
    font-family: inherit;
    font-size: 0.8em;
}

html code hl,
html pre hl,
html kbd hl,
html samp hl,
html ruby hl,
html .tag-list-item hl {
    display: none;
}
html ol > hl,
html ul > hl {
    display: none;
}

Don't worry if any customized tag is placed in the wrong place, we still have a chance to decide whether to show them or not with CSS.

So, while I experiment with this in the build scripts, could you try to make the translation without using the extra spaces? Hopefully this will work and I'll be able to port it to other CJK scripts. Thanks!

I'm glad to do so and see if this helps other CJK translations, though there are Chinese translations only at present :-).

flipacholas commented 1 year ago

That's a great breakdown of the script and it will help me to port the regular expressions. Let me know when you get the chinese translation ready and I'll test the regex. Many thanks!

Cerallin commented 1 year ago

I've just finished translating the Nintendo DS article (no extra spaces). Please handle it at a time that you deem appropriate.

flipacholas commented 1 year ago

Great, I've deployed it here for testing (it doesn't have the <hr> spaces, for now): https://www.copetti.org/zh-hans/writings/consoles/nintendo-ds/

I'm checking the regex effects on the Markdown article, and there seems to be the following bugs:

  1. Some sentences have extra spaces. Just did a quick review on Crowdin and deleted some of them, so it's just a matter of correcting the translation.

  2. There are some false positives (I think?) with the regexes. For instance, the following text at the start of the article:

和任天堂的[上一代掌机](game-boy-advance)一样,NDS的系统围绕一个名为**CPU NTR**的大芯片展开。

is replaced like this:

和任天堂的<hl>[</hl>上一代掌机](game-boy-advance)</hl>一样,NDS</hl>的系统围绕一个名为<hl>**CPU NTR**</hl>的大芯片展开。

The regex is applied on Markdown, so I think that's creating some confusion on the rules (I'm assuming it was originally made for HTML?). I guess I just need to tweak the regex.

But overall, this is very good progress and I really appreciate there's a new article available in Chinese. I'll try to find the causes of the regex problems meanwhile. Thanks.

Cerallin commented 1 year ago

Great, I've deployed it here for testing (it doesn't have the <hr> spaces, for now): https://www.copetti.org/zh-hans/writings/consoles/nintendo-ds/

Good news! Good to see my translation deployed. I may translate the GBA article later.

  1. Some sentences have extra spaces. Just did a quick review on Crowdin and deleted some of them, so it's just a matter of correcting the translation.

Okay, I will go through and check the spaces on Crowdin.

  1. There are some false positives (I think?) with the regexes. For instance, the following text at the start of the article:
和任天堂的[上一代掌机](game-boy-advance)一样,NDS的系统围绕一个名为**CPU NTR**的大芯片展开。

is replaced like this:

和任天堂的<hl>[</hl>上一代掌机](game-boy-advance)</hl>一样,NDS</hl>的系统围绕一个名为<hl>**CPU NTR**</hl>的大芯片展开。

The regex is applied on Markdown, so I think that's creating some confusion on the rules (I'm assuming it was originally made for HTML?). I guess I just need to tweak the regex.

You are right, it was originally made for HTML. I might give up writing markdown rules if I were you since markdown is very flexible so the regex may be too complex and loses readability. I suggest applying replacements to HTML files.

Please tell me if you need it and I will modify my hexo plugin to handle HTML files as an executable. NodeJS executables run slowly, but a few seconds per file sounds tolerable to me.

But overall, this is very good progress and I really appreciate there's a new article available in Chinese. I'll try to find the causes of the regex problems meanwhile. Thanks.

You are welcome. Please let me know if there's anything I can do to help.

P.S. I found some more characters with spaces after them to be trimmed off. The whole list is: ,。!?:.

flipacholas commented 1 year ago

Sound good! By the way, don't forget to sign your name or username here so I can credit you for the translation

Cerallin commented 1 year ago

P.S. I found some more characters with spaces after them to be trimmed off. The whole list is: ,。!?:.

Oops, ;… are also in the list.

And, I've finished my spaces checking on Crowdin. ✌️

flipacholas commented 1 year ago

Thanks! In my case I've been trying to learn more about how to improve the styling and layout for Chinese-speaking audiences (using simplified chinese scripts, in this case). I've recently changed the following (only visible in the chinese articles):

From your perspective, do you think they improve the reading experience for Chinese readers?

Cerallin commented 1 year ago

Wow! They do help a lot! The font families cover the default fonts of most devices. It looks pretty good with text justified and indented.

flipacholas commented 1 year ago

Glad it helped! I think it will take me some time to get the regex rules to properly parse latin text. However, I'm glad that I can improve the reading through css as well.