fsr-de / myHPI

Django/Wagtail page serving myhpi.de
https://myhpi.de
11 stars 11 forks source link

Formatting within a word is not supported by translations #544

Open lukasrad02 opened 6 months ago

lukasrad02 commented 6 months ago

When using inline formatting that is not surrounded by spaces, e.g. H<strong>e</strong>llo, in a translation, surrounding spaces will be added automatically when the content is converted back to markdown.

Translation editor:
image

Rendered Page: image

dropforge commented 6 months ago

Is this due to newlines being added? What is the HTML output?

lukasrad02 commented 6 months ago

There are no newline added, just spaces.

The HTML passed to html2text (see https://github.com/fsr-de/myHPI/blob/72588358ea069005922a8b3dd08dffca0ac34db5/myhpi/core/markdown/fields.py#L33) is exactly identical to the html entered to the translation editor.

I think (but haven't verified this yet) that html2text parses the whole HTML input into some AST-like structure that does not preserve formatting and uses some generic formatting rules when rewriting it as markdown, thus adding the spaces.

dropforge commented 6 months ago

Is it viable to switch from html2text to a library that translates the source directly as Markdown? @jeriox Some considerations for that:

  1. We might have to split the source into segments ourselves then, e. g. paragraphs, list item etc.
  2. DeepL intelligently translates link descriptions in HTML, e. g. moving the semantically equivalent parts of a sentence into / out of <a> tags.
jeriox commented 6 months ago

@dropforge I think it would be feasible, and given how much problems the HTML representation already caused I think it would be a good way forward. Back when we implemented the prototype/MVP it worked good enough, so we decided to go with it as it was quicker. If you are willing to do a deepdive on that I'd highly appreciate it!