WeblateOrg / weblate

Web based localization tool with tight version control integration.
https://weblate.org/
GNU General Public License v3.0
4.6k stars 1.02k forks source link

Auto Translation using Google Translate V3 doesn't preserve New Line characters #10145

Closed ChiragMoradiya closed 11 months ago

ChiragMoradiya commented 1 year ago

Describe the issue

When a String contains a new line character (\n). It's "Google Translate V3" auto-translation, replaces new-line character with a single space.

But, AWS Translation preserves new-line characters properly.

See Screenshots for the reference.

I already tried

Steps to reproduce the behavior

Expected behavior

Auto Translation in Hindi, using "Google Translate V3" should be हैलो\nवर्ल्ड

Screenshots

Google Translate V3 API:

Google Translate V3 API

AWS API:

AWS API

Exception traceback

No response

How do you run Weblate?

Docker container

Weblate versions

Weblate deploy checks

Output of `docker compose exec --user weblate weblate weblate check --deploy`

System check identified some issues:

INFOS:
?: (weblate.I021) Error collection is not set up, it is highly recommended for production use
    HINT: https://docs.weblate.org/en/weblate-4.17/admin/install.html#collecting-errors
?: (weblate.I028) Backups are not configured, it is highly recommended for production use
    HINT: https://docs.weblate.org/en/weblate-4.17/admin/backup.html
?: (weblate.I031) New Weblate version is available, please upgrade to 5.0.2.
    HINT: https://docs.weblate.org/en/weblate-4.17/admin/upgrade.html

System check identified 3 issues (1 silenced).

Additional context

There seems an issue in REST API invocation in https://github.com/WeblateOrg/weblate/blob/c7915cc0954da39169621ce3bbfa19ba189583fe/weblate/machinery/googlev3.py#L63C16-L63C16

It sends whole String as an element in the Array. e.g. ["Hello\nWorld"].

If it sends request as multiple array elements, split by new-line characters. e.g. ["Hello","world"]. And then join response elements back by new-line character, then this issue might be resolved.

NOTE: There is an additional cavity, if the request array contains any blank string; then Google treats this as an invalid request. So, such elements should be removed from the request element.

nijel commented 1 year ago

Splitting the text needs to be evaluated as well as it might negatively impact translation if Google treats segments as independent. For example, in Czech this should be translated as "Ahoj\nsvěte" while independent words would be translated as Ahoj, svět.

Meanwhile, I've added highlighting of whitespace in machine translation results in https://github.com/WeblateOrg/weblate/pull/10147, so that it is clearly visible.

github-actions[bot] commented 1 year ago

This issue has been put aside. It is currently unclear if it will ever be implemented as it seems to cover too narrow of a use case or doesn't seem to fit into Weblate.

Please try to clarify the use case or consider proposing something more generic to make it useful to more users.

ChiragMoradiya commented 1 year ago

So, it looks like this should be reported as a Bug to Google Transtate API service, instead here.

nijel commented 1 year ago

I don't think it's realistic to expect machine translation to always keep newlines at the right place.

ChiragMoradiya commented 1 year ago

I agree it won't always keep. But, here the issue is that, it always removes. It's never preserved.

We have added many mark-down texts as a Weblate Strings. And by not preserving new-line characters, their formatting gets screwed with auto-translation.

pickfire commented 11 months ago

How we did it in another project https://github.com/arvin-pantas/django-autotranslate/commit/85f0d8d6567070411b8abefa0c32391f2f52fb47

nijel commented 11 months ago

@pickfire Thanks, that is useful!