GSM-7 as input - Githubissues

willhardy commented 2 years ago

Some languages can't be translated if the text entered is from an SMS (eg GSM-7 encoded), this problem is visible in the deep-L web interface (eg try to translate "KAΛHΣΠEPA!").

The problem is that to save space, SMS encoding reuses the latin characters where it can (eg KAHEP) and mixes in greek letters where necessary (ΛΣΠ)

If you know the language the text is in, the solution is straightforward. For each language, convert the relevant latin characters to their native counterparts in unicode. When you do this, the translation works, eg translate ΚΑΛΗΣΠΕΡΑ!. For Greek, this might be:

original_greek_sms = "KAΛHΣΠEPA!"
latin = "EPTYIOAHKZXBNM"
greek = "ΕΡΤΥΙΟΑΗΚΖΧΒΝΜ"

table = str.maketrans(latin, greek)
original_greek_unicode = original_greek_sms.translate(table)
print(original_greek_unicode)  # ΚΑΛΗΣΠΕΡΑ!

Not all languages need this, but every language that has latin-like characters that need separate unicode symbols would need this conversion. Until then, text that comes from an SMS in those languages will not work.

daniel-jones-deepl commented 2 years ago

Hi Will, thanks for creating this issue. I wanted to let you know we are looking into it, but I don't have an answer yet.

daniel-jones-deepl commented 2 years ago

Hi again, I have some feedback from our machine translation team. We’re always interested in making our models more robust to non-standard inputs, including this case mixing Latin characters into Greek text.

Regarding the workaround you suggest, it may have unintended side-effects in the general case, for example for text containing mixed scripts. However if you know that this encoding fix works for your application, it makes to continue using the workaround on the client side.

I’ll keep this issue open in case other users encounter the same situation and your workaround can assist them. We may address this case in our models in future, however I don’t have a timeline for that.

DeepLcom / deepl-python

GSM-7 as input #32