Open Jacobinski opened 5 months ago
Thanks! I appreciate the detailed bug report as well :)
I won't have access to my computer for a few weeks, but I'll take a look at this as soon as I can.
No rush. Thanks :-)
Thanks for the PR @Jacobinski! Some thoughts below:
Ideally reformat_transcript
would be idempotent but as you pointed out it currently isn't.
You mentioned in your bug report here https://github.com/Gustaf-C/anki-chinese-support-3/issues/55#issuecomment-1892969852 how split_transcript
is fundamentally broken since it still splits greedily. You also pointed out that by separating pinyin on a space users can actually work around the issue with modified pinyin still being split incorrectly.
From this it feels to me like maybe a better solution is to find the few primitive methods which are mashing the pinyin together and have them leave it space separated instead. Nothing should break too horribly from that change (@Gustaf-C maybe you have a better idea of the implications though)? But this way we don't have to worry about not calling reformat_transcript
since it will properly handle the space-separated pinyin in any case, feels a little more robust.
This would also take care of the issue you noted here: https://github.com/Gustaf-C/anki-chinese-support-3/blob/1bb94ded55dfcd112649ec6bc6cffcdb8fa96720/tests/test_behavior.py#L51-L56
What are your thoughts?
So I have finally had some time to take a proper look at this, and it seems fine. I appreciate the included tests!
I agree with Kieran that there are probably better ways to solve this issue. There are still many parts of the code that I am quite unfamiliar with, so I can't really speak to the implications, but I also feel that nothing should break horribly from the changes suggested above. However I have also thought that about a few small changes that turned out to be anything but small once I actually tried to implement them.
That said this is an improvement, and at the moment I am all for making things a little better a bit faster at the expense of maybe having to redo them later. I'll leave this open for a little while, but if nothing changes I'll merge it in as-is before I push the next release.
Thanks for the feedback everyone and sorry that I haven't checked up on this issue in a while.
From this it feels to me like maybe a better solution is to find the few primitive methods which are mashing the pinyin together and have them leave it space separated instead. Nothing should break too horribly from that change (@Gustaf-C maybe you have a better idea of the implications though)? But this way we don't have to worry about not calling
reformat_transcript
since it will properly handle the space-separated pinyin in any case, feels a little more robust.This would also take care of the issue you noted here:
What are your thoughts?
@kieranlblack, I think that you're right. It would probably be better to leave the pinyin space separated while passing it around in the code (e.g. ke neng
vs keneng
) so that we don't have to rely on regular expressions to re-split the string. Give me a couple of days to try to implement that alternative approach and I'll post an update here once I'm done.
Before this change, there are two different ways that the code interacts with
pinyin
in a note:If the
pinyin
field is empty, it uses thehanzi
field as a source-of-truth to generate thepinyin
field.If the
pinyin
field is non-empty, it takes the contents ofpinyin
and runsreformat_transcript()
on it. The idea of this function is that it will update (split, colorize, etc) thepinyin
field with new information (that the user provides). This function does the splitting using a regular-expression.We have observed a bug in this logic that occurs for some words, such as 可能 which initially see the correct pinyin populated ("kě néng") but have this pinyin incorrectly change ("kěn éng") as a result of running the
reformat_transcript()
function on them. This bug can occur for any pinyin in which there are multiple acceptable regular expression splits.Bug report: https://github.com/Gustaf-C/anki-chinese-support-3/issues/55
This commit attempts to put a bandaid over the issue by avoiding repopulating the pinyin field for words if the user did not change the original hanzi transcription. Unit tests and documentation are also included in the commit.