NSoiffer / MathCAT

MathCAT: Math Capable Assistive Technology for generating speech, braille, and navigation.
MIT License
53 stars 32 forks source link

Chinese Translation #192

Closed hjy1210 closed 8 months ago

hjy1210 commented 10 months ago

@Nsoiffer, I want to do language translation for MathCAT. How to start?

hjyanghj@gmail.com

NSoiffer commented 10 months ago

What language do you want to translate?

There are instructions for translators here. That gives you a sense of what needs to be done.

If that looks doable, let me know what language you want to translate and I will create a branch with an initial translation done via google and potentially past translations in other systems. That maybe will be 30% good, so it is a start.

Keeping a translation in-sync with the English translation takes some work, so only ask me to generate the starting files when you are ready to do the work. That minimizes the updates. I will usually generate a translation in 1-2 days. I have to run some python scripts and do some sanity checks -- google translate sometimes does some weird things that cause the scripts to do bad things. Eventually, the translation start and sanity checks will get automated, but I'm not there yet.

hjy1210 commented 10 months ago

I want to try language zh-TW.

I have roughly read Translator and Rule Developer Guide.

I need your a branch with an initial translation done via google and potentially past translations in other systems to getting started.

Thanks

NSoiffer commented 10 months ago

I would very much appreciate the translation. Google has two options for Chinese: traditional and simplified. Which one is more appropriate for zh-TW?

MathCAT is capable of having language variants. If Taiwanese Chinese is similar to some of the other variants of Chinese and you know some variants, it is possible to create a "zh" version and then subdirectories "zh-xx" where "xx" is the country/region code. In those subdirectories, only the differences need to be written.

My apologies that I don't know more about language variations and also apologies if my reply is culturally offensive -- I'm not well-versed in international politics.

hjy1210 commented 10 months ago

Traditional Chinese is more appropriate for zh-TW.

I am not very familiar with simplified chinese and terminology(although very similiar) used in china.

Thanks

NSoiffer commented 10 months ago

I've started working on the translation. I need to do some more work on the scripts because the phrases I use for context to improve the google translation are causing problems for the Chinese translation because (I think) the languages are so different, google isn't finding the word in the phrase.

I also realized I have no idea whether the MathPlayer "zh" translation is zh-TW or zh-CN. I suspect it is zh-CN though because a high school intern that did the translation for MathPlayer had help from her parents, and they were from mainland China. Also, the google translations look different. Can you tell from this sample whether that used the MathPlayer translations which version of Chinese MathPlayer uses?

 - "∏": [t: "积"]                                    #  0x220f   (en: 'product', google: '產品。')
 - "∐": [t: "公同积"]                                  #  0x2210   (en: 'co-product', google: '副產品。')
 - "∑": [t: "和"]                                    #  0x2211   (en: 'sum', google: '和。')
 - "−": [t: "减"]                                    #  0x2212   (en: 'minus', google: '減。')
 - "∓": [t: "加减"]                                   #  0x2213   (en: 'minus or plus', google: '負或加號。')
 - "∗": [t: "星号运算"]                                 #  0x2217   (en: 'times', google: '時代。')
 - "∘": [t: "结合"]                                   #  0x2218   (en: 'composed with', google: '由。')

FYI: anything after "#" is a comment.

hjy1210 commented 10 months ago

About the sample you gave: Left side (- "−": [t: "减"] ) is in zh-CN, Right side (en: 'minus', google: '減。') is in zh-TW.

In general, left side using mathmatical term in zh-CN, right side google part using zh-TW and some term is not correct in math.

I can understand article in zh-CN. So it is OK if you give me some characters/terms in zh-CN, I can easily trlanslate it to zh-TW, especially I can use right side en part as reference.

Thanks

NSoiffer commented 10 months ago

I've committed the initial files to the "zh" branch. You'll find them in Rules/Languages/zh.

From the little testing I did, they generate Chinese. I think there is a problem with the definitions.yaml values as an NVDA error happens when I look at a simple numerical fraction.

From my memory of the MathPlayer translation, Chinese speaks fractions something like "b under a" (i.e., the denominator is spoken first). If that is true, then you want to find all the places that have tag: fraction and tag: mfrac in the zh directory and change the part that looks like:

  - x: "*[1]"
  - t: "超過"      # phrase(the fraction 3 'over' 4)
  - x: "*[2]"

to say the second child first. In this example, you want it to be

  - x: "*[2]"
  - t: "超過"      # phrase(the fraction 3 'over' 4)
  - x: "*[1]"

Let me know if you run into problems.

NSoiffer commented 10 months ago

@hjy1210: In regards to fractions and other notations, English has a number of ways special forms of fractions are spoken. If those are not relevant in Chinese, comment those rules out. If Chinese has some specialized ways of speaking a notation, you should add a rules for them. If you need help in writing a rule, let me know.

hjy1210 commented 10 months ago

@NSoiffer: I have made a small step, and I encounter an issue.

I just make some translation about fraction in zh\ClearSpeak_Rules.yaml, see my fork of MathCat.

Then use firefox to browse the web page, which have three fractions in it.

If user start NVDA+MathCat(with language en), they can screen read that page correctly in english, then switch MathCat to zh language, they can screen read that page in CHINESE as expected(I only translate part of fraction rule). Every thing seems OK.

But if user start NVDA+MathCat(with language zh), they CANNOT screen read that page, that page have three fractions, they can read only first two. After that, even they switch MathCat to en language, they still can read only the first two fractions in english.

I think it IS an issue.

Note:

  1. start NVDA+MathCat(en), got some log message:
    
    WARNING - mathPres.initialize (17:31:20.354) - MainThread (15596):
    MathPlayer 4 not available

WARNING - synthDrivers.oneCore._OcSsmlConverter.convertLangChangeCommand (17:31:26.849) - MainThread (15596): Language en_US not supported ({'zh_tw'})

2. start NVDA+MathCat(zh), browse to that page, got some log message:

WARNING - mathPres.initialize (17:40:35.728) - MainThread (3836): MathPlayer 4 not available

ERROR - scriptHandler.executeScript (17:40:59.708) - MainThread (3836): error executing script: <bound method CursorManager.script_moveByLine_forward of <virtualBuffers.gecko_ia2.Gecko_ia2 object at 0x0F827410>> with gesture 'down arrow' Traceback (most recent call last): File "scriptHandler.pyc", line 295, in executeScript File "cursorManager.pyc", line 274, in script_moveByLine_forward File "cursorManager.pyc", line 171, in _caretMovementScriptHelper File "speech\speech.pyc", line 1231, in speakTextInfo File "speech\types.pyc", line 42, in iter File "speech\speech.pyc", line 1538, in getTextInfoSpeech File "speech\speech.pyc", line 1203, in _extendSpeechSequence_addMathForTextInfo File "C:\Users\hjy\AppData\Roaming\nvda\addons\MathCAT\globalPlugins\MathCAT\MathCAT.py", line 367, in getSpeechForMathMl return ConvertSSMLTextForNVDA(libmathcat.GetSpokenText(), self._language) pyo3_runtime.PanicException: index out of bounds: the len is 1 but the index is 5

ERROR - scriptHandler.executeScript (17:41:05.292) - MainThread (3836): error executing script: <bound method CursorManager.script_moveByLine_forward of <virtualBuffers.gecko_ia2.Gecko_ia2 object at 0x0F827410>> with gesture 'down arrow' Traceback (most recent call last): File "scriptHandler.pyc", line 295, in executeScript File "cursorManager.pyc", line 274, in script_moveByLine_forward File "cursorManager.pyc", line 171, in _caretMovementScriptHelper File "speech\speech.pyc", line 1231, in speakTextInfo File "speech\types.pyc", line 42, in iter File "speech\speech.pyc", line 1538, in getTextInfoSpeech File "speech\speech.pyc", line 1203, in _extendSpeechSequence_addMathForTextInfo File "C:\Users\hjy\AppData\Roaming\nvda\addons\MathCAT\globalPlugins\MathCAT\MathCAT.py", line 367, in getSpeechForMathMl return ConvertSSMLTextForNVDA(libmathcat.GetSpokenText(), self._language) pyo3_runtime.PanicException: index out of bounds: the len is 1 but the index is 5 .... WARNING - gui.MainFrame._popupSettingsDialog (17:47:14.247) - MainThread (3836): _popupSettingsDialog is deprecated, use popupSettingsDialog instead. Stack trace: File "nvda.pyw", line 399, in File "core.pyc", line 814, in main File "wx\core.pyc", line 2237, in MainLoop File "gui__init.pyc", line 621, in onActivate File "C:\Users\hjy\AppData\Roaming\nvda\addons\MathCAT\globalPlugins\MathCAT\init.py", line 39, in on_settings mainFrame._popupSettingsDialog(UserInterface) File "gui\init.pyc", line 229, in _popupSettingsDialog ERROR - scriptHandler.executeScript (17:47:46.748) - MainThread (3836): error executing script: <bound method CursorManager.script_moveByLine_forward of <virtualBuffers.gecko_ia2.Gecko_ia2 object at 0x0F827410>> with gesture 'down arrow' Traceback (most recent call last): File "scriptHandler.pyc", line 295, in executeScript File "cursorManager.pyc", line 274, in script_moveByLine_forward File "cursorManager.pyc", line 171, in _caretMovementScriptHelper File "speech\speech.pyc", line 1231, in speakTextInfo File "speech\types.pyc", line 42, in iter__ File "speech\speech.pyc", line 1538, in getTextInfoSpeech File "speech\speech.pyc", line 1203, in _extendSpeechSequence_addMathForTextInfo File "C:\Users\hjy\AppData\Roaming\nvda\addons\MathCAT\globalPlugins\MathCAT\MathCAT.py", line 367, in getSpeechForMathMl return ConvertSSMLTextForNVDA(libmathcat.GetSpokenText(), self._language) pyo3_runtime.PanicException: index out of bounds: the len is 1 but the index is 5

NSoiffer commented 10 months ago

When I run things through google, I do so in a batch and use quotes and commas, etc., to separate the items to translate. The Chinese translation translated the punctuation. I had noticed it earlier and fixed them up, but apparently I overwrote definitiions.yaml with the bad version at some point. I have commited the fixed file. Grab that and give it another try. Let me know if you still have the problem.

Make sure you check the translations in definitiions.yaml -- that is where the ordinal numbers used in the fractions come from. In a test I just did with google translate, the fractions come out wrong because it uses a built-in function ToCommonFraction which speaks numerator followed by denominator. Look for that in ClearSpeak_Rules.yaml and SimpleSpeak_rules.yaml and change those to something like

  replace:
  - x: "*[2]"
  - t: "分之"
  - x: "*[1]"

...at least that what I think it should be based on a web page I just read as to how numeric fractions are spoken in Chinese.

Also, from your logs, it seems like the MathPlayer addon is active. You should deactivate it because only one math translator should be active. I believe that is why you are getting:

WARNING - mathPres.initialize (17:40:35.728) - MainThread (3836):
MathPlayer 4 not available
hjy1210 commented 10 months ago

@NSoiffer

  1. As for warning message

    WARNING - mathPres.initialize (13:52:06.134) - MainThread (11816):
    MathPlayer 4 not available

    I left it behind, because it is still there even I reinstall NVDA without any add-on.

  2. After change all characters '、' to ',' in definitiions.yaml, the issue I mentioned before

    
    If user start NVDA+MathCat(with language en), they can screen read that page correctly in english, 
    then switch MathCat to zh language, they can screen read that page in CHINESE as expected
    (I only translate part of fraction rule).

But if user start NVDA+MathCat(with language zh), they CANNOT screen read that page, that page have three fractions, they can read only first two. After that, even they switch MathCat to en language, they still can read only the first two fractions in english.


 is gone. Every thing is OK now.
hjy1210 commented 10 months ago

@NSoiffer

I understand math contents should read out in different way for different kind of disabilities(blindness, learning disabilities, low vision) and for different context.

I appreciate MathCat want to care all disabilies.

I know user can choose from MathCat setting about

But in Rules,

In summary, In order to translate as appropriate as possible, I need to full understand the context the rule applied.

I can spend time to study yaml and rust, but I DO need some guide, some short tutorials about yaml and rust.

NSoiffer commented 10 months ago

The variable that is set by "Speech for:..." is $Impairment. I haven't figured out what I should differently for low vision vs learning disability, so all of the tests in the code are $Impairment = 'Blindness' (or ...!=...). I haven't done a good job with testing that in the code. I suspect I haven't had complaints yet because so far the users are blind. But I know that one AT company the caters to users with learning disabilities is working on incorporating MathCAT into their product, so I really should fix it.

Typically, if someone can see the math expression, they don't want to hear start end words such as "fraction ... end fraction". I have created #194 so I remember to do this.

The variables that are $ClearSpeak_XXX$ are ClearSpeak options that users can set. I suggest using SimpleSpeak as your starting point for how to speak things in Chinese -- I think ClearSpeak is more English-oriented. As I said, you shouldn't necessarily follow what is done for English -- make the Chinese version be what is good for Chinese users. You can name the file anything you want in Chinese -- the MathCAT dialog will pick up anything that ends in _Rules.yaml. If you delete ClearSpeak_Rules.yaml or rename it to something else (e.g, ClearSpeak_Rules.yaml.ignore, it won't show up in the dialog. Or if you decide you want to implement three different ways of speaking in Chinese, as long as the files are end in _Rules.yaml, users will be able to select them. Using - include: file name allows you to share common rules between different styles.

I can not locate definition of ToCommonFraction appeared at common-fraction rule in ClearSpeak_Rules.yaml

That is a function I defined in Rust code (src/xpath_functions.rs). It was a convenient way to get numerical fractions to speak, but it doesn't work for Chinese, so that is why I suggested the replacement mentioned above.

I can not understand the meaning of *[1][self::m:mn][not(contains(., '.')

This is an xpath (1.0) expression. There are many tutorials on xpath, so I didn't spend too much time in my documentation explaining it. Here's an explanation for that expression:

Putting all the pieces together, the expression says "match if the first child is a mn that doesn't have a decimal point" (i.e., it is an integer). If Chinese uses , as a decimal separator, then you would change the . to ,.

xpath is pretty error tolerant, so if there were no children, it will just fail to match as opposed to give an error that there is no first child.

I can spend time to study yaml and rust, but I DO need some guide, some short tutorials about yaml and rust.

Unless you want to help with actual coding (not translation), I don't think you need to learn Rust. Learning about YAML and xpath though is helpful. I think YAML is a pretty straightforward format, but xpath (what is used for tests in the rules and x:) can be confusing to learn. Hopefully existing examples of use help you figure things out. It is a very powerful pattern matching language.

Hopefully the above helps some.

hjy1210 commented 10 months ago

@NSoiffer Thanks a lot, your response is very clear and helpful. I think I can make some progress now.

In Translator and Rule Developer Guide, it seems never mention that in order to perform test cargo test fr we need to put entry "mod fr;" in languages.rs at the same level as "mod en;" .

The value of $ClearSpeak_Fractions is defined in prefs.yaml, but I don't know how user can change them if they wish.

One consideration: If user can change setting easily, user can switch between different styles

NSoiffer commented 10 months ago

Thanks for the feedback on the documentation. I have fixed it.

Currently, the NVDA dialog lists only the more common preferences someone might change. Users can edit (with a standard text editor) %AppData%\MathCAT\prefs.yaml. That contains the full list of options. The MathCAT dialog reads and writes that file. That's how they can change ClearSpeak_Fractions.

The reason to have multiple styles is so that users can choose the one they like the most. My plans are to add a "MathSpeak" style. MathSpeak is based on Nemeth code and MathSpeak is a 1-1 way speaking that matches the braille. It is not very natural though (e.g., "x superscript 2 baseline plus 1"). I also plan to add a "LiteralSpeak" style that doesn't do any semantic inference. It would say "vertical bar x vertical bar" instead of "absolute value of x". I probably would add a few options so that you could hear "x squared" rather than "x superscript 2" but still hear "vertical bar x vertical bar".

For your translation, I recommend getting one style to work well and then add more styles later on if you have the energy to do that.

zh-yx commented 10 months ago

Hi all!

I am a translator for Simplified Chinese language.

I think the translation of the Chinese Mainland region should be placed in the "zh/cn" subdirectory, and the corresponding Taiwan region can be placed in the "zh/tw" subdirectory.

To my knowledge, generally speaking, the two are similar in structure and have some differences in specific wording. Therefore, we can separate them to provide users with easier to understand translations.

NSoiffer commented 10 months ago

Ideally, there should be one shared set of rules for what is common and then the differences placed in the cn and tw subdirs of zh. But from the little that I read, none of the Unicode files can be shared because the code points for Traditional and Simplified are different. I suspect the rule files are very similar, but they too will contain Unicode characters (for "fraction", etc), so I suspect that two separate sub-directories will be needed. One possibility though is to check the language for the cases where literal text is used. For example (using the initial translation I sent) The shared rule for simple/fraction would be

- name: simple
  # don't include nested fractions. E.g, fraction a plus b over c + 1 end fraction" is ambiguous
  # by simplistic SimpleSpeak's rules "b over c" is a fraction, but if we say nested fractions
  # are never simple, then any 'over' applies only to enclosing "fraction...end fraction" pair.
  tag: fraction
  match:
  - "(IsNode(*[1],'leaf') and IsNode(*[2],'leaf')) and"
  - "not(ancestor::*[name() != 'mrow'][1]/self::m:fraction)" # FIX: can't test for mrow -- what should be used???
  replace:
  - x: "*[1]"
  - test
        if: "$Language='zh-tw'"
        then: [t: "超過"]     # phrase(the fraction 3 'over' 4)
        else: [t: "超过"]      # phrase(the fraction 3 'over' 4)
  - x: "*[2]"
  - pause: short

It is up to the two of you if you would like to share the file via this approach or keep the files separate. Even if you take the second approach, I'm sure both of you would benefit from comparing what each of you has done to modify the rules to be better for Chinese. I'm 100% certain that the differences in how math should be spoken between Traditional and Simplified Chinese are much, much smaller than the differences between those languages and English.

I'm pretty sure MathCAT handles the sub-directories correctly. However, I do need to modify the NVDA MathCAT dialog code to handle it.

Once a translation PR is made, I'll update the directories.

hjy1210 commented 9 months ago

@NSoiffer My opinion: keep the files separate is more simpler.

hjy1210 commented 9 months ago

@NSoiffer Have you inspected my pull request #216? Is it acceptable?

NSoiffer commented 9 months ago

My apologies. I was away for a week 1.5 weeks ago and have been playing catch up ever since because people pointed out some MathCAT issues at the conference I was at.

Also, I'm working with someone on fixing the dialog so country codes are supported. That way zh-cn and zh-tw are both handled appropriately. However, I should still review your code in case there are things that need to change. I apologize for taking so long and very much appreciate your efforts. I'll try to look at on Friday. Today is the Thanksgiving holiday and I don't have much time to spend on code today.