dmMaze / BallonsTranslator

深度学习辅助漫画翻译工具, 支持一键机翻和简单的图像/文本编辑 | Yet another computer-aided comic/manga translation tool powered by deeplearning
GNU General Public License v3.0
2.48k stars 172 forks source link

[Feature request] Replace all newline characters with one space each #558

Closed notimp closed 2 weeks ago

notimp commented 2 weeks ago

Google lens OCR is one of the best free OCR options out there, Google translate is also one of the most popular free translators used by many in here but if you use it (even with the "handling newline" setting turned to remove in the Google lens module), there will still be linebreaks in every recognized text bubble - which are fine if you dont plan on doing a manual quality assurance pass (editing all text in the comic), but become the most time intensive step as you have to remove them manually during a quality control pass, if you plan on reflowing the text afterwards. (Using a different font size, or changing the size of the rectangle, both count as reflowing the text. :) )

Please add an option for a text parser to remove every non breaking space (\s) and every newline (\n) from all text in one text bubble (recognized and handed forward as "one text box" by the ocr, and translated as "one text box" by the translator module) and replace it with one space exactly. So all the text we get back in the translation box for one text bubble in a comic by default will be an endless line, and only limited by the text boxes boundaries.

Here are some examples to illustrate the issue.

All images are "as google lens OCR, google translate and BallonsTranslator filled out those bubbles in auto mode, after hitting run". None of the results were modified manually.

Screenshot 2024-08-26 071354

In all the examples in this image, you see the same behavior, after the first word, there is a newline break in the text.

Ich
werde sie sofort anrufen!
Eine
Woche später...
Eva
die Zeitungen sprechen von
Hören, 
die Brauttoilette ist aus sehr feinem Goldfasermaterial

If you plan on reflowing those bubbles (changing their size, or their font size) you have to always remove that newline break manually, as it still will be honored after reflowing the text.

Here is a more outrageous example:

Screenshot 2024-08-26 071442

Warten sie eine Minute! Sie hätten die Drogen ins Meer
werfen
können. Bevor wir sie gehen lassen, werfen wir noch einen
Blick
unter Wasser. Meine beiden Agenten sind ausgezeichnete
Taucher...

So what I'm proposing is this: Give us an optional feature that parses the text and removes all newline breaks (soft and hard), and replaces them with one space each, for the text in every recognized speachbubble.

So this:

Bildschirmfoto 2024-08-26 um 07 54 05

which will give the following result:

Bildschirmfoto 2024-08-26 um 07 54 43

==

Warten sie eine Minute! Sie hätten die Drogen ins Meer werfen können. Bevor wir sie gehen lassen, werfen wir noch einen Blick unter Wasser. Meine beiden Agenten sind ausgezeichnete Taucher...

-- so that the text within a text bubble will be limited by the boundries of the the textbox only, and not by linebreaks that were in the text already.

Make it available as an optional feature (many people will prefer the current behavior, as it gives you better results if you dont plan on doing a quality control pass afterwards).

Having such an optional feature would make reflowing text, much less time consuming in an optional manual quality control pass afterwards, where you'll be touching close to every textbox and resizing it anyhow..

I'm currently unsure if google lens (unlikely, see textboxes), or google translate add those newlines (new line characters), or its actually the way BallonsTranslator handles reflowing the text, please make it an optional feature to not have those linebreaks occur though.

If you know if google lens, or google translator, or BallonsTranslator itself add those newline characters, please tell us (/me).

Any help would be appreciated. :)

Thank you,

notimp

edit: I'll also post the uncleaned test images, give me a sec.

notimp commented 2 weeks ago

Here are the unedited images for you to test on. Source language is Dutch (Netherlands) target language used by me was German.

s120 0023

s120 0014

dmMaze commented 2 weeks ago

In the config panel, Typesetting, uncheck Autolayout and there will be no extra line breaks inserted into the translated text. Besides you can write regexper in Titlebar->Edit->Keyword subtition for OCR & translated text.

notimp commented 2 weeks ago

In the config panel, Typesetting, uncheck Autolayout and there will be no extra line breaks inserted into the translated text. Besides you can write regexper in Titlebar->Edit->Keyword subtition for OCR & translated text.

Thank you very much. I'll close the ticket once I had the chance to confim it. :)

Thank you!

notimp commented 2 weeks ago

Unchecking that checkbox solved all my problems. Thank you. I'm closing this ticket now.

dmMaze commented 1 week ago

Commented in the wrong issue, so I'll post it here:

截屏2024-09-04 21 32 58

The automatic mode should perform better now, the original algorithm was mostly for the scenario of manga. You may want to try it with that checkbox checked and set the font size to use the global setting. Tune the line spacing may also affect the results.