MaruTama / Mengshen-pinyin-font

OpenSource Pinyin Font and that is created tools.
SIL Open Font License 1.1
191 stars 9 forks source link

Implement the homograph (heteronyms) #1

Open MaruTama opened 5 years ago

MaruTama commented 5 years ago

"行" is (Xíng), however when it is "银行" the pinyin is (YínHáng). Since ligatures are not registered in this font, "银行" is displayed as (YínXíng).

NightFurySL2001 commented 4 years ago

Is it possible to generate the different pinyin glyphs with different glyphs name first? E.g. making both zhang and chang in different file for 长(長). This can accelerate the building process as the glyphs will be available for swapping with minimal changes further. There is also contextual swapping i.e. swapping 1 glyph with another depending on the context, but it'll be as hard to implement as using ligatures.

Also, ligatures require the building of double-word glyphs for each pair of words i.e. 行啊,行了,银行,行长 (also have 2 pronunciation: hang zhang, head of bank; hang chang, line lenght) etc which can dramatically increase file size.

MaruTama commented 4 years ago

The implementation we are considering is as follows: Predefine a standard pinyin for each character. Store the polysyllabic pattern in lookup table. If the pattern matches, replace it with another glyph.

We think "calt" is appropriate as a feature tag. ccmp, slat, and aalt also We believe it can be done, but it is not suitable.

The reason is follows:

  1. It can use "Chaining contextual substitution (GSUB LookupType 6)".

  2. The purpose is not a ligature, but a context-dependent character substitution. Refer to Syntax for OpenType features in CSS

    This feature, in specified situations, replaces default glyphs with alternate forms which provide better joining behavior. Like ligatures (though not strictly a ligature feature), contextual alternates are commonly used to harmonize the shapes of glyphs with the surrounding context.

  3. Many environments are expected to support it. (slat is not supported; aalt requires the user to select the replacement character. Refer to calt

    UI suggestion: This feature should be active by default.

  4. Does not affect other GSUB feature

    Feature interaction: This feature may be used in combination with other substitution (GSUB) features, whose results it may override.

(5. Chinese is't ideographic scripts. I don't have to worry about following)

Script/language sensitivity: Not applicable to ideographic scripts.

Implementation example

Statement of Expectations 行啊 => xíng a ★银行 => yín háng ★道行 => dào, héng 长城 => cháng chéng ★行长 => xíng zhǎng (If you want the user to choose between "xíng zhǎng" or "háng cháng", I assume you can use aalt.) ☆了得 => liǎo de

Standard Pinyin 行 => xíng 长 => cháng 了 => le 得 => dé


# ★ Describes a substitution pattern for the different pinyin of "行".
lookup CNTXT_884C {
    substitute 银 行’ by 行.ha2ng;
    substitute 道 行’ by 行.he2ng;
} CNTXT_884C;
lookup CNTXT_957F {
    substitute 行 长' by 长.zha3ng;
} CNTXT_957F;

# ☆ Describes a substitution pattern for idiom.
lookup CNTXT_4E86_5F97 {
    substitute 了' 得 by 了.lia3o;
    substitute 了.lia3o 得’ by 得.de;
} CNTXT_4E86_5F97;

# Describe the context
feature calt {
    substitute 银' lookup CNTXT_884C 行’;
    substitute 道' lookup CNTXT_884C 行’;
    substitute 行' lookup CNTXT_957F 长’;
    substitute 了' lookup CNTXT_4E86_5F97 得’;
} calt;

Appendix

I found an example on the web that uses OpenType Ruby tags. I think it's an interesting example, like the IVS @NightFurySL2001 mentioned it.

Recently there has been a repo that makes bopomofo with newer OpenType technology, BPMF IVS. It utilize the IVS (ideographic variant selector) in Unicode to change between different pinyin (eg providing 4 glyphs: zháo zhāo zhe zhuó for 着) and also put the variant glyphs in stylistic set (SS01-04). You can visit that repo and check how it works. BPMF IVS has a pinyin standard in bopomofo, so that all fonts generated with that program can be used interchangably without losing the tonal marks (if using IVD). Is it possible to recreate it in this program? (and maybe use the same pinyin standard, which will make conversion between bopomofo and hanyu pinyin a ton easier by just changing fonts)

NightFurySL2001 commented 4 years ago

(5. Chinese is't ideographic scripts. I don't have to worry about following)

Chinese is an ideographic script. Some programs may not support calt substitution of CJK characters on purpose.

Sadly, making a pinyin font to swap out heteronyms using OpenType would be a bit far-fetched as there exist cases where even the same double word pair produce different pinyin:

or

or

This may require listing exhaustively all the possible pairs of words in two, three, or even four word pairs which may require a longer time for software processing.

This should actually be done in an external software and then copy paste into required place. Some basic processing still could be done using OpenType but you will have to limit how far the font can handle before requiring external intervention. Example range could be all the heteronyms in HSK, while heteronyms outside HSK will not be replaced and manual substitution is required.

注音符號數位化顯示計畫

This is actually not related to this project as it promotes the use of ruby annotation instead of bopomofo in font file.


Side notes:

The best bet for the heteronyms in OpenType is to actually reference to the BPMF IVS as it uses bopomofo in font file itself and provides the ability to "remember" what pinyin is chosen by using newer technology of Ideographic Variation Selector (日本語:異体字セレクタ). It also provides the usage of Stylistic Set but the selection will be lost when copy and pasting to other softwares.

This requires that the input text contains the correct IVS to display the pinyin correctly which is impossible for texts online.

MaruTama commented 4 years ago

Oh... really. Chinese can't seem to use calt...

Chinese is an ideographic script. Some programs may not support calt substitution of CJK characters on purpose.

I see... I should think about this.

Sadly, making a pinyin font to swap out heteronyms using OpenType would be a bit far-fetched as there exist cases where even the same double word pair produce different pinyin:

Thank you so much for your help. I should implement using IVS.

NightFurySL2001 commented 4 years ago

It do looks like rclt could be used in place of calt as it may be used on all scripts, but I think we could try calt anyway to determine if CJK ideographs are really incompatible in software. Alternatively, ccmp can be used but it will require a little bit of modification to the OpenType specification. I will send an email to Dr. Ken Lunde, previous head of engineer of Source Han series to check about using calt with CJK ideographs.

The first priority is to make the glyphs. There is also a limiting factor of 65535 glyphs in a OpenType font which may be an issue. A subset of SHS may be required to empty/release more glyph spaces for pinyin characters.

MaruTama commented 4 years ago

It's an example of homograph support using calt tag and IVS. I will continue to implement it. However, there is a problem that calt becomes invalid when there are many feature tags. I plan to replace calt tag to rclt. 2020-10-25-19 17 04

NightFurySL2001 commented 3 years ago

👍

NightFurySL2001 commented 3 years ago

Is it possible to explain the text in English? I dont really understand how you did it XD https://github.com/MaruTama/Mengshen-pinyin-font/blob/master/NOTE.md

What is the source of dictionary of homographs that the lookup used? It doesn't seem to support homographs for Traditional Chinese (e.g. 乾(gān)淨/乾(qián)坤). Also there's this:

U+7D8F: suī,suí,shuāi,ruí,tuǒ #綏 U+7EE5: suí #绥

They are simplified/traditional, but shouldn't have that much difference... right? The sources I can access give suī,suí only.

MaruTama commented 3 years ago

Is it possible to explain the text in English? I dont really understand how you did it XD https://github.com/MaruTama/Mengshen-pinyin-font/blob/master/NOTE.md

It's okay. I will organize and translate.

They are simplified/traditional, but shouldn't have that much difference... right?

I referred to the following dictionary. Traditional Chinese is not yet supported. There was no Traditional Chinese in the dictionary here.

Sorry.... I'm not very familiar so I have a question. Is Traditional Chinese the same homographs as Simplified Chinese?

The sources I can access give suī,suí only.

I referred to here.

NightFurySL2001 commented 3 years ago

~Is~(Do) Traditional Chinese (have) the same homographs as Simplified Chinese?

Not exactly, some homographs in Traditional Chinese was seperated in Simplified Chinese (eg. 乾 gān/qián -> 干gān净、乾qián坤) and some Traditional Chinese was combined into one homographs in Simplified Chinese (eg. 乾gān淨、幹gàn部、支干gàn -> 干gān净、干gàn部、支干gàn). 干 is a very suitable example of how Simplified Chinese messed with the pronunciation....

I referred to here.

Well guess that'll work...

MaruTama commented 3 years ago

干 is a very suitable example of how Simplified Chinese messed with the pronunciation....

I see... 干 is messing because different characters have been merged....