Animenosekai / translate

A module grouping multiple translation APIs
GNU Affero General Public License v3.0
525 stars 60 forks source link

Missing spaces between tags when using translate_html #71

Closed thanhtoan1196 closed 1 year ago

thanhtoan1196 commented 2 years ago

Missing spaces between tags when using translate_html

Code

from translatepy import Translator
print(Translator().translate_html("<p>I am a student and <strong>you are a teacher</strong></p>", "de"))

Current:

<p>Ich bin Student und<strong>du bist ein Lehrer</strong></p>

Expected:

<p>Ich bin Student und <strong>du bist ein Lehrer</strong></p>
Animenosekai commented 2 years ago

Thanks for reaching us!

I could indeed reproduce your problem.

Just checked the code, we do not seem to remove the strings intentionally. This might be done by any translator outside translatepy.

Because we don't expect all the translators to support HTML translation, we need to separate each component of the HTML to translate them apart and reassemble everything at the end.

This has the side effect that each component is treated as separate, thus any cleaning (stripping the spaces for example) is done on every component.

<p>I am a student and <strong>you are a teacher</strong>, incredible</p>
~~~^^^^^^^^^^^^^^^^^^^~~~~~~~~^^^^^^^^^^^^^^^^^~~~~~~~~~^^^^^^^^^^^^~~~~
           1                          2                       3

These are 3 separate components, which will each be translated separately

Now, the problem is that we don't know what kind of cleaning is done by the translators, and it might even be different translators translating the different components.

For some differently structured languages, the translator might be adding or removing some kind of specific symbols which has a meaning in the resulting language.

The order of symbols in a single phrase might also need to be different.

Now if we introduce a basic checking before translating to see if we need to re-add spaces after the translation or not

...
if tail_space_before_translation and not result.endswith(" "):
    result += " "
...

It might work for Latin-based languages translations, but the translator might have deleted the spaces for a reason :

(will take my native languages for simplicity)

<p>Je suis un étudiant <strong>et vous êtes un professeur</strong></p>

Should be translated in Japanese to

<p>僕は生徒で<strong>あなたは先生です</strong></p>

Notice that we removed the space, because we usually don't use lots of spaces in Japanese

We see that this behavior is also found when translating with translatepy

>>> from translatepy import Translate
>>> t = Translate()
>>> r = t.translate_html("<p>Je suis un étudiant <strong>et vous êtes un professeur</strong></p>", "Japanese")
>>> r
'<p>私は学生です<strong>そして、あなたは先生です</strong></p>'

(which is a weird translation because of the component separation, but that's another topic)

I would need to come up with a better algorithm to translate HTML content without losing the context (language wise and HTML wise) but I guess that would require complex NLP

If you have any idea, I would welcome them.

If you have any question or issue, feel free to ask them!

Oh, and sorry for being a bit inactive lately, but school work is way busier compared to what I previously had...

Animenosekai commented 1 year ago

Closing this for now, since it's been a while since this got any activity.

I partly continued this discussion in #93 if you are interested.

Feel free to reply if you want to reopen it!