bbepis / XUnity.AutoTranslator

MIT License
1.97k stars 292 forks source link

Manual translated Chinese being translated again as Japanese #128

Closed flywhc closed 3 years ago

flywhc commented 4 years ago

I got a problem that AutoTranslator using GoogleTranslate to translate manual translated Chinese text to Chinese again. This double translation caused incorrect results and I don't know how to troubleshoot this problem.

I have a large translation folder in zh-CN for a Japanese game. The original translation was copied from HS2 English translation. It is including RedirectedResources, r:, sr: etc. e.g: sr:"^(?[\u0020-\u024F「【」】]+)(?[^\u0000-\u024F]+)(?[\u0020-\u024F「【」】]*)$" It has more than 6,000 entries of text translation and more than 60,000 Japanese name translation.

I translated English to Chinese from [Japanese A]=[English A] to [Japanese A]=[Chinese A] For example: from プレイヤー衣装選択=Player costume selection to: プレイヤー衣装選択=玩家服装

Then the game shows correct Chinese text ([Chinese A]) for a very short time then it becomes weird Chinese [Chinese B]. I checked _AutoGeneratedTranslations.txt it creates new entries like: 玩家服装=玩具屋衣服 Format is [Chinese A]=[Chinese B]

That means AutoTranslator thought [Chinese A] in my txt file are Japanese and sent them to google to translate to Chinese again. I have to disable online auto translation by set Endpoint= then it works fine.

I suspect it related to 'sr:' so I tried removing "sr:" with "" in the file but still cannot solve the problem. I guess it is a bug because it should never auto translate text in txt file again. The Version is 4.12

gravydevsupreme commented 4 years ago

The source of the problem is redirected resources for sure.

The problem is that these texts are translated before they are loaded by the UI. Then when they are loaded onto the UI, the auto translation functionality kicks in and re-translates it because it thinks it is not translated.

I'm not sure what a viable solution to this is when it is very difficult to distinguish between the two languages.

There is a similar problem when a JP name is included in an otherwise english text that was loaded through redirected resources. In that case that string would be re-translated again, although in that case the input and output would likely be very similar.

Any solution would probably require fiddling a bit with the text resource redirector allowing it to block standard translation functionality in specific situations or for certain texts.

flywhc commented 4 years ago

No ways to distinguish Japanese and Chinese because they share thousands of characters. Traditionally Korean and Vietnamese also share many characters, named CJKV characters.

Thank you for your detailed reply although I don’t understand the design of this plug-in. Would it be possible to automatically add special prefix when the plug-in loading redirected resource and remove this prefix when UI calls the plugin to display?

If it is hard to be automatic, I would it be possible to add a special prefix to disable auto translation for a line or a part of phrase? Even can be a bracket to address the mixture of English and Japanese problem you mentioned. Such as:

[Japanese A]=[Japanese B] nt:{Do not Translate this} The plug-in then should display: [Translated B] Do not translate this

Then I can use a simple script to patch all files under Redirected Resources folder with this prefix or bracket.

GeBo1 commented 4 years ago

@flywhc Can you try putting an extra file file under Text with [Chinese A]=[Chinese A]?

For example:

玩家服装=玩家服装

If that works a change could be made to TextResourceRedirector that might solve it (or at least help).

gravydevsupreme commented 4 years ago

Hmm, @flywhc, now that you mention it yourself there may be a workaround of the sorts that you are mentioning, although I would like to solve this properly.

The default configuration of XUA comes with an option that indicates that it should ignore any texts that starts with '\u180e', a mongolian vowel separator.

If the translation (not the untranslated text!) in the redirected resources starts with this character, the plugin should ignore the text. So you could make all translated entries start with this character.

With that being said, that option, as it is implemented in the in XUA currently would not work as intended for this. But that is a bug more than anything else, something that I'd be willing to change.

Also note that unicode escaping is not currently supported in the translation file format, so you'd have to write the actual character, which is invisible, and not \u180e.

Maybe an approach like this could be integrated directly into the text resource redirector if we simply make a convention within XUA.

gravydevsupreme commented 4 years ago

Maybe this could be handled completely transparently to the text resource redirector as well. The plugin could simply add the mongolian vowel separator character to all translations added to it (or whatever would be used internally, could be any combination of characters really) and then the plugin could identify that combination of characters and fix the text before display.

In that way it would Just Work™, without needing any modifications to the text resource redirector and without needing any modifications to the translations. And it would also fix occasional double translations in English, which does sometimes happen to this day.

I'll sleep a bit on this. :)

gravydevsupreme commented 4 years ago

This could cause problems if there is some pre-processing of text before display though. Like OPTION, or whatever it is that is used in various games. That may be an issue, right? @GeBo1

GeBo1 commented 4 years ago

@gravydevsupreme For entries with prefixes (all the HS2 ones should be documented) the text on the translation side is handed off (mostly) untouched, so as long as the game engine leaves it alone things like this should work:

PREFIX1:こんにちは= Hi!
// \u180e just after =

In some cases the string is embedded in something else (usually , separated, which is why TRR replaces , with an unicode comma that looks close enough in those cases), then the game engine splits on , and sends the text along, so I don't see why it wouldn't work.

gravydevsupreme commented 4 years ago

I see. All good then. I was primarily worried about messing up some "encoded text" by including a character inside a prefix rendering it unrecognizable.

gravydevsupreme commented 4 years ago

Just as a follow up: I hope to try to implement a prototype for this during the weekend.

flywhc commented 4 years ago

Thank you @GeBo1 and @gravydevsupreme.

I did two tests:

  1. copy \RedirectedResources*.* to \Text (both folder has same assets folder) It works well for a while but a double translated sentence is created in the _AutoGeneratedTranslations. that text is stored in \Text, not \RedirectedResources\

  2. add 0x180E after every '=' in txt under \RedirectedResources I saw a few double translated sentences in _AutoGeneratedTranslations too and all of them are stored in \Text, not \RedirectedResources too.

It seems \RedirectedResources is one of the causes of the problem. \Text also has the problem but much lower occurrence.

gravydevsupreme commented 4 years ago

Hi. Please try this version.

XUnity.AutoTranslator-BepIn-5x-4.13.0-beta1.zip

And report backs if it works. It should prevent double translation of redirected resources entirely. (can be controlled through new configuration option).

RedirectedResourceDetectionStrategy=AppendMongolianVowelSeparatorAndRemoveAll

Available options:

These options are included because different games obviously handles it's text resources differently combining texts in different ways.

flywhc commented 4 years ago

Thank you gravydevsupreme. I overwrote all files in my game folder and keep the option unchanged and play the game. Then I didn't see any strings in RedirectedResource being double translated.

one string has translation under RedirectedResources but it created an Google AutoTranslate version in the AutoTranslator file \RedirectedResources\assets\abdata\studio\info\00\animecategory_00_00\translation.txt: 主人公=主角 \Text_AutoGeneratedTranslations.txt: 主人公=英雄

Following double translated strings in _AutoGeneratedTranslations are originally stored in other txt files under Text folder:

初始颜色=从最开始
月=月
Z轴旋转=Z车削
呵呵呵~我们真是相思相爱啊~来吧~来吧~呵呵~呵呵呵……♪\n[72A4]=呵呵呵〜对,对不起,对不起,我要来了,我要来了,我要来了

Good news is during half hour testing only above strings have problem. Majority of strings are correct. They have no special characters from other text, such as other strings in same txt file are all working fine and they are in different scences and different txt files.

gravydevsupreme commented 4 years ago

About the first issue of:

主人公 being translate differently.

I am not quite sure this is the same issue. If this was a "double translation" issue as initially described the problem would be that 主人公 would first be translated into Chinese (主角) and then 主角 would be retranslated into Chinese under the assumption it was actually Japanese, which means you should have following entries instead:

主人公=主角 主角=???

With the two translation you describe here it simply means that 主人公 will be translated differently depending on where it is loaded from, which is a valid scenario and one of the exact reasons we even have redirected resources in the first place.

I am not quite sure I am following the second issue you are mentioning. You will have to be more specific.

flywhc commented 4 years ago

@gravydevsupreme Sorry I did explain clearly. It has two issues:

  1. 主人公 is not double translated. The AutoTranslator didn't take my translation in \RedirectedResources...translation.txt and called GoogleTranslate and create a new line in _AutoGeneratedTranslations.txt for it. This word is the last line of the file \RedirectedResources...translation.txt:
....
床=地板
飲食=饮食
寝=睡觉
主人公=主角

I used hexeditor to compare them and both "主人公" have no difference and no Mongolian char added. It is a default name in the input field that allows users to change the name of the character. It is the only entry that has the problem so it is not a big deal. I solved it by adding extra "主人公=主角" into a txt file under \Text folder.

  1. 初始颜色 to 呵呵呵~我们真是相思相爱啊 are another set of text under \Text folder, which are double translated: 初始颜色 is in \Text\Main\Character_Maker.txt 月 is in \Text\Main\Character_Maker.txt Z轴旋转 is in Text\Main\Character_Maker.txt 呵呵呵~我们真是相思相爱啊 is in \Text\zz_MachineTranslations\z_MachineTranslation.txt and I found more in z_MachineTranslation.txt during repeat tests I found more text being double translated in Character maker and the dialog after H scene. The strange is in repeating tests only these strings have the problem and they aren't special than other strings in the same txt file. such as X轴旋转 and Y轴旋转 are very similar to Z轴旋转 (rotate by Z axis) but they don't have the problem.
gravydevsupreme commented 4 years ago
  1. I think this may be because that translation simply doesn't belong in that redirected resource translation file then. I assume it would not be used for the English translation either then.

  2. So, to clarify further, these four last translations have nothing to do with (and are therefore not present in) redirected resource translations? But none-the-less they are being translation twice like so (example with Z轴旋转):

[original JP]=> Z轴旋转 (zh) Z轴旋转 (zh, assumed JP) => Z车削

flywhc commented 4 years ago
  1. That should be the reason! adding an additional pairs in \Text should be correct solution.

  2. yes

in the file HS2\BepInEx\Translation\zh-CN\Text\Main\Character_Maker.txt 回転Z=Z轴旋转 [jp'rotate by Z axis'] = [zh'rotate by Z axis']

in the _AutoGeneratedTranslations.txt it created a new line after running the game: Z轴旋转=Z车削 [zh'rotate by Z axis'] = [zh'lathe by Z axis']

gravydevsupreme commented 4 years ago

If it double-translates like that it is certainly a bug. I tried reproducing but I was unable to. Can you please tell me where exactly this occurs for '回転Z'?

I tried reproducing it with hair position adjustments, where this exact text is present, but it just worked.

gravydevsupreme commented 4 years ago

Perhaps you can provide a translation folder with the minimal translations required to reproduce the problem and tell me where the problem occurs?

flywhc commented 4 years ago

I reproduced the issue with English HS2. I am using [ScrewThisNoise][Illusion] HoneySelect 2 (ハニーセレクト2) BetterRepack R4

https://files.catbox.moe/rekm1q.zip Extract above zip file to BepInEx\Translation\en\Text\Main it overwrites Character_Maker.txt.

Then run the game and enter the Male Character maker. Then it generates the following lines in the _AutoGeneratedTranslations.txt:

初始颜色=From the beginning
月=Month
Z轴旋转=Z rotation

They are in form of [Chinese]=[English]

In Character_Maker.txt, they are:

初期カラー=初始颜色
月=月
回転Z=Z轴旋转

[Japanese]=[Chinese]

I don't know where They are displayed on the screen. Hair settings display correct text.

Another case is dialog after H scene and I saw incorrect text displayed on the screen. There are too many dialog texts and the result is very random (only occurred once in half-hour) so I don't know how to reproduce it.

gravydevsupreme commented 4 years ago

I was able to reproduce it now.

I don't think those texts are ever displayed anywhere.

The problem originates from the translation scoping:

#set exe HoneySelect2
// Scene: CharaCustom (3)
#set level 3

in the top of the file. If you remove those, the double translations will no longer be added.

I am not sure why but the game seems to have text components existing outside of any scenes where those texts are displayed in. Maybe they are copied after character editor creation and kept out of the scene.

I would not be surprised if it is perhaps related to @ManlyMarco Illusion Modding API, which duplicates UI components already existing in the scene in order to allow other mods to extend on the UI. I am not sure why they would not be present in any scene, though, since as far as I am aware you can't even instantiate unity objects outside scenes.

In fact if I remove that mod the double translation no longer occurs even with translation scoping enabled. (obviously that breaks the entire game since half the plugins that exist depend on it, so don't do that).

But I think the main takeaway from this is: I don't think any of these texts are ever displayed anywhere and they may simply be an artifact of the how game engine works, how the game is implemented and how game object duplication works.

flywhc commented 4 years ago

Thanks @gravydevsupreme . That solved another cause.

We have the last occasion: it randomly displays a double translated sentence after H scene in game. this is the JP=ZH: えーっと……そのー……今のわたし……なにも抵抗できないですよ……ほら、まともに立てないですもん……=嗯……那个……现在的我……什么也抵抗不了啊……看~根本站不起来啊……[751B]

I added a line number like "[751B]" at every end of the line of Chinese.

The game only displays "嗯..." without "[751B]" after H scene when the NPC sum up. in the _AutoGeneratedTranslations.txt it added a line of ZH=ZH : 嗯……那个……现在的我……什么也抵抗不了啊……看~根本站不起来啊…\n…[751B]=嗯...

I notice all of the double translated text are added with '\n'. I am not sure it is the reason.

gravydevsupreme commented 4 years ago

Please indicate which translations files are involved in this so I can test and reproduce it.

Actually, nevermind. If the game adds a newlines and dots after the translation it is obvious why it happens. Not sure about what a solution would be though.

flywhc commented 4 years ago

~Please indicate which translations files are involved in this so I can test and reproduce it.~

Actually, nevermind. If the game adds a newlines and dots after the translation it is obvious why it happens. Not sure about what a solution would be though.

It is in \Text\zz_MachineTranslations\z_MachineTranslation.txt

Will automatically add MongolianVowel help? - I guess not because I tested before. Does IgnoreWhitespaceInDialogue add newlines?

Can we add newlines if a translated sentence longer than a preset length so the game won't add newlines?

gravydevsupreme commented 4 years ago

You can try the following:

Change RedirectedResourceDetectionStrategy=AppendMongolianVowelSeparatorAndRemoveAll to RedirectedResourceDetectionStrategy=AppendMongolianVowelSeparator.

This will not attempt to remove the vowel separator which is used to identify the string as a redirected resource and as such when the game adds "\n..." to the string it will still think it has been redirected and ignore it.

I am not sure how the game is going to react to that, though. Maybe it will flood sentences with squares indicating missing characters in the font or maybe it will just work.

Either way the real solution is tougher because it is about finding a solution to the situation where the game adds untranslated text to an already translated text, which is also an issue in other games (I have seen it in several).

flywhc commented 4 years ago

Does the plug-in compare the string to be displayed with existing translated text strings, then query googleTranslate if no entry matches?

So is this how the double translation happens?

  1. the game adds ‘\n’ (no dots) if the text is too long to fit screen
  2. but the plug-in cannot match it with any translated text
  3. so the plug-in query googleTranslate

If this is the case, can it do similar logic as IgnoreWhitespaceInDialogue, to remove all ‘\n’ from text and then compare it with translated strings?

gravydevsupreme commented 4 years ago

I doubt the ellipsis is added because the text cannot fit on the screen. This is probably just the behaviour of the game.

Dots are not whitespace, and all whitespace is already removed from texts of this size. But this is only during translation. It is completely unrelated to the text lookup itself. The plugin does a lot of things related to whitespace to make smart lookups. It doesn't do anything about punctuation, though.

Any proper solution really involves identifying the fact that text is being appended onto a translated text. Implementing this is a performant manner is just a relatively big task. And the fact that is somehow also involved redirected resources make it quite a bit more complex.

Also: Did my suggestion not work?

flywhc commented 4 years ago

Ellipsis is not added. Both Japanese and Chinese use two characters to represent one ellipse ( totally 6 dots)

The game only added newline characters in a silly way: if the text length longer than n then it inserts a newline char after the n-th character, even in the middle of jp/zh two-character ellipsis. In the real world, we never separate two 3-dots ellipses in two lines. So I think simply remove all newline before all text comparison should solve the problem.

In most cases the game display correct. I have to find a way to make this problem always reproducible then I can validate your suggestion.

Talking about smart lookup, I got a few texts in _AutoGeneratedTranslations.txt contains whitespace \u20, while I checked the English HS2 project uses " " \u3000 for the same Japanese original text. I haven't seen this text on the screen yet so I hope the smart lookup will treat \u3000 as \u20 It is hard to validate these random appeared texts.

flywhc commented 4 years ago

@gravydevsupreme your suggestion works.

I used script to replace all dialog texts to the string I mentioned which always causes double translation. Then I can always reproduce the issue.

Then I added Mongolian vowel before every dialog text and change the setting. Finally no auto-generated text has been added. Thank you!

I am not unity plug-in developer but if I develop a similar app, I will use a sqllite dB to store 4 indexed fields: original text; original text without newline and whitespace (\u0020, \u00a0, \u3000); translated text; translated text without newline and whitespaces. Then match the translated text without newline and whitespaces first, skip translation if found. Then try to match original then original without whitespaces.

gravydevsupreme commented 4 years ago

I have now released v. 4.13.0. I will probably look into a generic solution that allows the plugin to detect text-appends in a generic fashion in the future.

flywhc commented 4 years ago

I have now released v. 4.13.0. I will probably look into a generic solution that allows the plugin to detect text-appends in a generic fashion in the future.

Does the v4.13.0 including the change? I extracted XUnity.AutoTranslator-BepIn-5x-4.13.0.zip to my game folder and the double translation is back. I have to replace plugins with the beta version you post in this thread to solve double translation in redirected resources.

gravydevsupreme commented 4 years ago

It should include all changes that I have posted here, so I certainly don't understand this.

Can you try removing the RedirectedResourceDetectionStrategy line from your config (so it resets to default) and try again?

flywhc commented 4 years ago

It should include all changes that I have posted here, so I certainly don't understand this.

Can you try removing the RedirectedResourceDetectionStrategy line from your config (so it resets to default) and try again?

Tested again with a clean game. It works. Thank you!