DeepLcom / deepl-python

Official Python library for the DeepL language translation API.
MIT License
1.06k stars 75 forks source link

Translation with glossary and target "EN-GB" looses some words #111

Open EnricoPicci opened 1 week ago

EnricoPicci commented 1 week ago

I have a text to translate from Italian to English, this one text_to_translate = "| \\_VOEMI | Data emissione operazione | Deve essere maggiore o uguale alla data di emissione della polizza e minore o uguale alla data di sistema. |"

I have also a glossary I want to use

entries = {"Fattore": "Variable", "Data emissione": "Issuance date"}
my_glossary = translator.create_glossary(
    "My glossary",
    source_lang="IT",
    target_lang="EN",
    entries=entries,
)

If I translate the text with target "EN-GB" i get this result | Issuance date | Must be greater than or equal to the policy issue date and less than or equal to the system date. | The issue here is that the part | \\_VOEMI gets lost.

However, if I specify that the target language is "EN-US" I get this correct result | | \_VOEMI | Issuance date transaction | Must be greater than or equal to the policy issue date and less than or equal to the system date. |

JanEbbing commented 1 week ago

Im not 100% what your use case is, but you will get the highest possible translation quality by parsing structured data like this before feeding it into the API, for example in your case:

text_to_translate = "| \\_VOEMI            | Data emissione operazione | Deve essere maggiore o uguale alla data di emissione della polizza e minore o uguale alla data di sistema.     |"
special_tokens = ["\\_"]
delimiter = "|"
translator = deepl.Translator(...)

translated_texts = []
for text in text_to_translate.split(delimiter):
    if (not text.strip()) or any(map(lambda tok: text.contains(tok), special_tokens)):
        translated_texts.append(text)
        continue
    else:
        # you might want to trim the whitespace here as well with text.trim(), and maybe
        # fill up the missing whitespace when appending to translated_texts, as this looks like a table
        translated_texts.append(translator.translate_text(text, ...).text)
output = delimiter.join(translated_texts)

Due to the nature of ML models, we otherwise cannot guarantee that the output is stable/preserves these kinds of tokens. You can also take a look at ignore tags as another option.

EnricoPicci commented 1 week ago

Jan, thanks for your prompt response. I will implement your suggestions. At the same time it is interesting the different behaviour between "EN-GB" and "EN-US".