DeepLcom / deepl-php

Official PHP library for the DeepL language translation API.
MIT License
202 stars 23 forks source link

Fix `JsonException: Malformed UTF-8 characters, possibly incorrectly encoded` #43

Closed VincentLanglet closed 4 months ago

VincentLanglet commented 6 months ago

Hi @JanEbbing @daniel-jones-deepl,

We recently encountered an issue with text input which after deepl translation cannot be json_decoded by the library. I create a reproducer of the issue. Notice this only occurs with the option

TranslateTextOptions::TAG_HANDLING => 'xml',

This is how is rendered the input in my IDE

image

The test added is failing with the error:

JsonException: Malformed UTF-8 characters, possibly incorrectly encoded

This is especially annoying because when translating a payload with 1000 texts, if one of them has such a character, the whole payload is failing and no text is translated (when 999 could have been).

Is something can be done:

Thanks

JanEbbing commented 6 months ago

Hi, @VincentLanglet - I'm still investigating this but it might be an issue with PHP/fixable in the library. The equivalent curl command returns a valid response (though the tag handling is a bit messed up).

$ curl -X POST 'https://api.deepl.com/v2/translate' \
--header 'Authorization: DeepL-Auth-Key MYKEY' \
--header 'Content-Type: application/json' \
--data '{"text": ["Portal<span></span>"], "target_lang": "FR", "source_lang":"EN", "tag_handling":"xml", "ignore_tags":["notranslate"]}'

{"translations":[{"detected_source_language":"EN","text":"Portail<span>/span&gt;"}]}

I tried logging the request/response that gets sent over the wire via PHP yesterday but the logger I used modified the data, so I'm checking with a new one now.

VincentLanglet commented 6 months ago

Hi, @VincentLanglet - I'm still investigating this but it might be an issue with PHP/fixable in the library. The equivalent curl command returns a valid response (though the tag handling is a bit messed up).

I tried logging the request/response that gets sent over the wire via PHP yesterday but the logger I used modified the data, so I'm checking with a new one now.

When I tried to log the \CURLOPT_POSTFIELDS generated in HttpClientWrapper::urlEncodeWithRepeatedParams, I had

target_lang=fr&source_lang=en&text=Portal%3Cspan%3E%EE%A0%83%3C%2Fspan%3E&tag_handling=xml&ignore_tags=notranslate

notice that without the tag_handling, the request with

target_lang=fr&source_lang=en&text=Portal%3Cspan%3E%EE%A0%83%3C%2Fspan%3E&ignore_tags=notranslate

works.

When looking at sendCustomHttpRequest, the content type used is

$headers['Content-Type'] = 'application/x-www-form-urlencoded';

Maybe the 'Content-Type: application/json' would be better instead then if it works by json ? Or maybe the API lib need a better support for form-urlencoded data ?

daniel-jones-dev commented 6 months ago

Hi @VincentLanglet, I was sick until yesterday, looking into this now too.

I don't think this is likely a problem of PHP, as I can reproduce it with our Python library too. Nor is it likely a problem with sending a JSON-encoded request or URL-encoded request; I could reproduce the response in both cases.

It seems to be caused because this input (combined with XML tag-handling) triggers some unusual case, and our API response includes an invalid UTF-8 sequence: \xEE\xA0 rather than \xEE\xA0\x83. I've forwarded the issue to our backend teams.

VincentLanglet commented 4 months ago

It seems to be caused because this input (combined with XML tag-handling) triggers some unusual case, and our API response includes an invalid UTF-8 sequence: \xEE\xA0 rather than \xEE\xA0\x83. I've forwarded the issue to our backend teams.

Thanks @daniel-jones-dev, any idea about when it will be solve ? We're still getting the error and it fails a lot of our API calls.

daniel-jones-dev commented 4 months ago

Hi @VincentLanglet, the backend team has looked into the cause of this issue; unfortunately it will not be easily fixed.

In the meantime, I wonder if a workaround in this library could help you: we could suppress these invalid UTF-8 sequences by replacing them with the replacement character “�” (U+FFFD), this would at least allow you to use the other requests. Do you think this would help you?

VincentLanglet commented 4 months ago

In the meantime, I wonder if a workaround in this library could help you: we could suppress these invalid UTF-8 sequences by replacing them with the replacement character “�” (U+FFFD), this would at least allow you to use the other requests. Do you think this would help you?

Sure, I talked with our team and it would help a lot. (As a first step until the issue is fixed on the api side).

What implementation did you have in mind ? I think that

mb_substitute_character(0xFFFD);
$content = mb_convert_encoding($content, 'UTF-8', 'UTF-8');

could do the job.

daniel-jones-dev commented 4 months ago

Thanks @VincentLanglet, your change for the workaround looks good. We need to check some internal tests and then we should be able to merge this tomorrow.

The backend team will still work on fixing the issue in the API.

daniel-jones-dev commented 4 months ago

Workaround is published in v1.7.2