Avoid running `reformat_transcript()` on original pinyin

Jacobinski commented 5 months ago

Before this change, there are two different ways that the code interacts with pinyin in a note:

If the pinyin field is empty, it uses the hanzi field as a source-of-truth to generate the pinyin field.
If the pinyin field is non-empty, it takes the contents of pinyin and runs reformat_transcript() on it. The idea of this function is that it will update (split, colorize, etc) the pinyin field with new information (that the user provides). This function does the splitting using a regular-expression.

We have observed a bug in this logic that occurs for some words, such as 可能 which initially see the correct pinyin populated ("kě néng") but have this pinyin incorrectly change ("kěn éng") as a result of running the reformat_transcript() function on them. This bug can occur for any pinyin in which there are multiple acceptable regular expression splits.

Bug report: https://github.com/Gustaf-C/anki-chinese-support-3/issues/55

This commit attempts to put a bandaid over the issue by avoiding repopulating the pinyin field for words if the user did not change the original hanzi transcription. Unit tests and documentation are also included in the commit.

Gustaf-C commented 5 months ago

Thanks! I appreciate the detailed bug report as well :)

I won't have access to my computer for a few weeks, but I'll take a look at this as soon as I can.

Jacobinski commented 5 months ago

No rush. Thanks :-)

kieranlblack commented 4 months ago

Thanks for the PR @Jacobinski! Some thoughts below:

Ideally reformat_transcript would be idempotent but as you pointed out it currently isn't.

You mentioned in your bug report here https://github.com/Gustaf-C/anki-chinese-support-3/issues/55#issuecomment-1892969852 how split_transcript is fundamentally broken since it still splits greedily. You also pointed out that by separating pinyin on a space users can actually work around the issue with modified pinyin still being split incorrectly.

From this it feels to me like maybe a better solution is to find the few primitive methods which are mashing the pinyin together and have them leave it space separated instead. Nothing should break too horribly from that change (@Gustaf-C maybe you have a better idea of the implications though)? But this way we don't have to worry about not calling reformat_transcript since it will properly handle the space-separated pinyin in any case, feels a little more robust.

This would also take care of the issue you noted here: https://github.com/Gustaf-C/anki-chinese-support-3/blob/1bb94ded55dfcd112649ec6bc6cffcdb8fa96720/tests/test_behavior.py#L51-L56

What are your thoughts?

Gustaf-C commented 4 months ago

So I have finally had some time to take a proper look at this, and it seems fine. I appreciate the included tests!

I agree with Kieran that there are probably better ways to solve this issue. There are still many parts of the code that I am quite unfamiliar with, so I can't really speak to the implications, but I also feel that nothing should break horribly from the changes suggested above. However I have also thought that about a few small changes that turned out to be anything but small once I actually tried to implement them.

That said this is an improvement, and at the moment I am all for making things a little better a bit faster at the expense of maybe having to redo them later. I'll leave this open for a little while, but if nothing changes I'll merge it in as-is before I push the next release.

Jacobinski commented 4 months ago

Thanks for the feedback everyone and sorry that I haven't checked up on this issue in a while.

From this it feels to me like maybe a better solution is to find the few primitive methods which are mashing the pinyin together and have them leave it space separated instead. Nothing should break too horribly from that change (@Gustaf-C maybe you have a better idea of the implications though)? But this way we don't have to worry about not calling reformat_transcript since it will properly handle the space-separated pinyin in any case, feels a little more robust.

This would also take care of the issue you noted here:

https://github.com/Gustaf-C/anki-chinese-support-3/blob/1bb94ded55dfcd112649ec6bc6cffcdb8fa96720/tests/test_behavior.py#L51-L56

What are your thoughts?

@kieranlblack, I think that you're right. It would probably be better to leave the pinyin space separated while passing it around in the code (e.g. ke neng vs keneng) so that we don't have to rely on regular expressions to re-split the string. Give me a couple of days to try to implement that alternative approach and I'll post an update here once I'm done.

Gustaf-C / anki-chinese-support-3

Avoid running `reformat_transcript()` on original pinyin #56