Animenosekai / translate

A module grouping multiple translation APIs
GNU Affero General Public License v3.0
509 stars 60 forks source link

Links and hashtags seem to change after translation #93

Open reddere opened 1 year ago

reddere commented 1 year ago

When using GoogleTranslate(), it alterates the links capital and non-capital letters randomly. How to fix this?

Animenosekai commented 1 year ago

Do you have an example to reproduce ?

reddere commented 1 year ago

Do you have an example to reproduce ?

Absolutely @Animenosekai ! Here is a text I got from a tweet. Notice how both hashtags and the tweet link letters are alterated. In the second hashtag, a letter even gets added out of nowhere.

from translatepy.translators.google import GoogleTranslate 

text = 'Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n#Fortnite #FortniteLastResort https://t.co/m1cE9sSrNb'

translator = GoogleTranslate()

italian_text = translator.translate(text, 'Italian')

print(italian_text)

Result: Kado Thorne è un vampiro e ha viaggiato nel tempo dal 2020 quando apparve nell'oro della pelle.\n\n#FORTNITE #FORTNITLelasTResort https://t.co/M1ce9SSRNB

Even if the normal text got translated fine, hashtags and link got alterated:

Any ideas on how to fix this?

Animenosekai commented 1 year ago

Parsing with a Regex maybe ?

reddere commented 1 year ago

what do you mean? theres params I can pass to the GoogleTranslate() instance that allow me to hide parts of the passed text using regex?

Animenosekai commented 1 year ago

what do you mean? theres params I can pass to the GoogleTranslate() instance that allow me to hide parts of the passed text using regex?

Nope not for now but should I ?

Here is the major problem coming with this and HTML translation though :

https://github.com/Animenosekai/translate/issues/71#issuecomment-1312795629

TLDR: Might work for Latin based languages, but different languages have different structures and the order of words might need to change from one language to another. (this is also one of the reasons why when we translate stuff we don't translate each word individually and put back the pieces)

reddere commented 1 year ago

Yeah I mean implement what I said would actually make it way better. The issue you mentioned kinda relates to the topic, and yeah thats easily fixable by just add a space in the final result after the dots or commas, if missing, but yeah implementing regex or any other way to hide certain parts of text would be awesome as it's frequent to alterate them

Animenosekai commented 1 year ago

Yes, this issue might be easier to handle than normal translations, as links don't exactly mean anything and don't need to be translated.

But, here is the problem :

First, it is not possible to separately translate things because it might not result in the best translation (because words have different meanings as a whole rather than individually). Also, as said before, there is no telling the position of the link should change, thus we can't just pin the position of the link and replace it after the translation:

(French) Je voudrais changer le lien https://google.com parce qu'il me semble y avoir trouvé une erreur
(Japanese) https://google.comのリンクに問題があると思うから変えたいです

Notice the change of position of the link

Now, if we let the translator translate everything and it ends up having issues with the links, we might want to find the link in the translated text and replace it with the previous one.

Something like this would be imaginable:

def link_correction(translated_text: str, links: list[str]) -> str:
    """A simple link correction function to keep the same links as before translation"""
    processing_text = translated_text.lower()
    for link in links:
        index = processing_text.find(link.lower()) # try to find the link in the translated text
        translated_text = translated_text[:index] + link + translated_text[len(link) + 1:] # just replace the link with the one before translation
    return translated_text

Note
This is an oversimplification of what could be done

Now, as you mentioned previously:

Link went from https://t.co/m1cE9sSrNb to https://t.co/M1ce9SSRNB. This alteration breaks entirely the link.

So if we have two links similar lower cased, they might be both replaced by the same link.


Now what should I do ?

Note
Even if I'm only talking about links here the same thing applies to the hashtags, with the exception that hashtags are even harder to correct after the translation as they might carry some meaning and might need to be translated

ZhymabekRoman commented 12 months ago

@reddere, Use GoogleTranslateV2 and specify all your "static" links/hashtags into specific span tag:

<span class="notranslate">TAGS OR LINKS THERE</span>

For more information visit: https://cloud.google.com/translate/troubleshooting

In [5]: from translatepy.translators.google import GoogleTranslateV2

In [6]: dl = GoogleTranslateV2()

In [9]: dl.translate('Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notran
   ...: slate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>', 'it')
Out[9]: TranslationResult(service=Translator(Google), source='Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notranslate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>', source_lang=Language(Spanish), dest_lang=Language(Italian), translation='Kado Thorne è un vampiro e ha viaggiato indietro nel tempo a partire dall\'anno 2020 quando gli è stata presentata la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notranslate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>')
reddere commented 11 months ago

Thank you so much @ZhymabekRoman @Animenosekai . Haven't tested the workaround yet, but I kept my old GoogleTranslator until just 2 days ago when I tried the ReversoTranslator, which to me, seems to work even better than GoogleTranslator. Both on a lexical and choice of word level, in Italian seems to work decently.

Somehow though, I did find an issue for that one as well, as it throws error when word like única are in the source text, but I find better to open a separate issue for that one: https://github.com/Animenosekai/translate/issues/96

Animenosekai commented 11 months ago

Was talking with Venom on Discord about possible workarounds and support for notranslate or other HTML parsing ways of not translating certain parts of a given input. Might consider this soon.