DeepLcom / deepl-python

Official Python library for the DeepL language translation API.
MIT License
1.06k stars 75 forks source link

translate_text corrupts HTML #72

Open pbtsrc opened 1 year ago

pbtsrc commented 1 year ago

text=

<html>
<body>
  <div>
    <a href="01.html">Chapter I. Margaret Makes Herself at Home</a>
  </div>
  <div>
    <a href="02.html">Chapter II. Stephen's Life Goes On</a>
  </div>
</body>
</html>

translate_text(text, source_lang='EN', target_lang='DE', tag_handling='html') for the above text returns this:

<html>
<body>
  <div>
   <a href="01.html">Kapitel I. Margaret macht es sich gemüt</a>lich  </div>
  <div>
   <a href="02.html">Kapitel II. Stephens Leben geht</a>weiter  </div>
</body>
</html>

As you can see the content of <a> has lost its tail (lich, weiter). If we use tag_handling='xml' all works as expected:

<html>
<body>
  <div>
    <a href="01.html">Kapitel I. Margaret macht es sich gemütlich</a>
  </div>
  <div>
    <a href="02.html">Kapitel II. Stephens Leben geht weiter</a>
  </div>
</body>
</html>

If we replace <div> with <p> there will be no issue either.

pbtsrc commented 1 year ago

Another example. text=

<p>1-<i>London, Paris</i></p>

translate_text returns:

<p>1-London<i>, Paris</i></p>

Same result with tag_handling='html' and tag_handling='xml'

seekuehe commented 1 year ago

@pbtsrc By chance, are you using both tag_handling and preserve_formatting parameters?

pbtsrc commented 1 year ago

No, I did not use preserve_formatting. I tried to add this parameter, but it did not change anything.