Gustaf-C / anki-chinese-support-3

Anki add-on providing support for Chinese study
https://ankiweb.net/shared/info/1752008591
GNU General Public License v3.0
25 stars 7 forks source link

Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Open Jacobinski opened 5 months ago

Jacobinski commented 5 months ago

Describe the bug

For certain words, such as 宾馆 and 可能, the extension will generate a correctly colored set of pinyin initially, but if the user clicks onto the "Color" field in the card (maybe other fields reproduce this too?), the pinyin coloring will incorrectly change. Please find two demonstration videos attached below.

https://github.com/Gustaf-C/anki-chinese-support-3/assets/14207293/7c20bde3-d591-4abd-ab51-7cd42cc4d04d

https://github.com/Gustaf-C/anki-chinese-support-3/assets/14207293/81cf701b-99c7-4306-9446-e25a28a609ce

Reproduction

  1. Install the Chinese Support 3 add-on and disable all other add-ons. Perform the standard setup procedures from the README document in this GitHub repo to add a Chinese (Basic) card-type to your Anki.
  2. Create a new Chinese (Basic) card, enable Chinese Support V3, then copy-and-paste 可能 into the Hanzi field. It should auto-populate the other card fields.
  3. Expand the Pinyin HTML and verify that it is <span class="tone3">kě</span><span class="tone3">yǐ</span> <!-- ke yi -->
  4. While having the Pinyin HTML expanded, click on the Color field and observe that the Pinyin HTML incorrectly changes to <span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->

Expected behavior

The Pinyin does not change into an incorrect version of itself.

Specs

Jacobinski commented 5 months ago

There is an interesting unit test, called test_issue_78, in the code which seems to be checking for something similar to my issue. Here is the bug-report associated with the test: https://github.com/luoliyan/chinese-support-redux/issues/78.

https://github.com/Gustaf-C/anki-chinese-support-3/blob/42af60260d313e7c3c5db2ebabff7495e3e9ed67/tests/test_behavior.py#L37-L48

It's not clear to me that this ever was "correctly" fixed. Running a Git Blame on the test shows that it was added in https://github.com/luoliyan/chinese-support-redux/commit/7a88d51a1ff663a01e05244bead2b7eabd0b670d which seems to be a hack to workaround the issue.

Jacobinski commented 5 months ago

This issue can be reproduced by adding the following new unit test to the class FormatPinyin(Base) in test_behavior.py.

    def test_issue_55(self):
        note = {'Hanzi': '可能', 'Pinyin': 'kěnéng'}
        expected = (
            '<span class="tone3">kě</span>'
            '<span class="tone2">néng</span> '
            '<!-- ke neng -->'
        )
        reformat_transcript(note, 'pinyin', 'pinyin')
        self.assertEqual(note['Pinyin'], expected)
        reformat_transcript(note, 'pinyin', 'pinyin')
        self.assertEqual(note['Pinyin'], expected)

When tested using $ make test, the test fails with error:

======================================================== FAILURES =========================================================
_______________________________________________ FormatPinyin.test_issue_55 ________________________________________________

self = <tests.test_behavior.FormatPinyin testMethod=test_issue_55>

    def test_issue_55(self):
        note = {'Hanzi': '可能', 'Pinyin': 'kěnéng'}
        expected = (
            '<span class="tone3">kě</span>'
            '<span class="tone2">néng</span> '
            '<!-- ke neng -->'
        )
        reformat_transcript(note, 'pinyin', 'pinyin')
>       self.assertEqual(note['Pinyin'], expected)
E       AssertionError: '<span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->' != '<span class="tone3">kě</span><span class="tone2">néng</span> <!-- ke neng -->'
E       - <span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->
E       ?                       -                                              -
E       + <span class="tone3">kě</span><span class="tone2">néng</span> <!-- ke neng -->
E       ?                                                  +                  +

tests/test_behavior.py:60: AssertionError
Jacobinski commented 5 months ago

Another discovery:

If you go into test_transcribe.py and add a new test for 可能 in class Transcribe(Base) the test suite passes.

image

However, if you add a test for the same word into class SplitTranscript(Base) then the test suite fails

image
Jacobinski commented 5 months ago

Based on the above comment, if we look at the fact that the transcribe(['可能'], 'pinyin', 'simp') == ['kě néng'] test passes, and the split_transcript(' kěnéng', 'pinyin') == ['kě néng'] test fails, it seems to suggest that the issue lies in split_transcript.

https://github.com/Gustaf-C/anki-chinese-support-3/blob/42af60260d313e7c3c5db2ebabff7495e3e9ed67/chinese/transcribe.py#L200-L242

It seems like what this function does it attempt to use regular expressions to split the pinyin word kěnéng into multiple pieces based on initials and finals. Unfortunately, kěn éng and kě néng are both valid splits of the word, so it's up to luck if the correct split is chosen for any given word.

Perhaps we can avoid attempting to split the pinyin based on a regular expression and instead always rely on the Hanzi field as a source of truth?

Jacobinski commented 5 months ago

This is actually kind of tricky to work around since we cannot always rely on the hanzi field as a source of truth. In many cases, there are multiple valid pinyins for a given hanzi (ex. 大都 has two translations: dàdōu means "for the most part"; dàdū: "metropolis") and if the extension chooses the wrong pinyin, the user should be able to overwrite it without the hanzi acting a source-of-truth. And if the user overwrites it, we can no longer rely on the hanzi field as a source-of-truth.

I think that the simplest way to work-around the issue for now it to treat the hanzi as a source-of-truth unless it is changed by the user. In this way, we will avoid having reformat_transcript() overwrite the translation, avoiding my reported bug. If the user changes the pinyin, the new value can act as a source-of-truth and we can run reformat_transcript() against it. This will still have issues for pinyin with multiple valid regular expression splits, but hopefully it will avoid the number of occurrences of this bug. Additionally, users should be able to manually put spaces in their pinyins to avoid this issue.

Please find my implementation of this fix in https://github.com/Gustaf-C/anki-chinese-support-3/pull/56