DeepLcom / deepl-python

Official Python library for the DeepL language translation API.
MIT License
1.06k stars 75 forks source link

Abnormal strings in translation results: ãã£ã£ãç§ã¡ã¡åã #95

Open gouyuwang opened 4 months ago

gouyuwang commented 4 months ago

Yellow text is the result of a translation error, white text is the original text bug The API used is:https://api.deepl.com/v2/translate

JanEbbing commented 4 months ago

Hi, could you please share the code you are running to get this result, and copy paste the input text instead of posting it as a screenshot? (So I can reproduce the text) This might be an issue with the encoding used.

gouyuwang commented 4 months ago

Hi, @JanEbbing

func DeepLTranslate(srcLang, targetLang, text string) (string, error) {
    authKey := ""
    urlValues := url.Values{}
    urlValues.Add("auth_key", authKey)
    urlValues.Add("target_lang", targetLang)
    urlValues.Add("text", text)
    if len(srcLang) > 0 {
        urlValues.Add("source_lang", srcLang)
    }
    resp, err := ctxhttp.PostForm(context.Background(), &http.Client{}, "https://api.deepl.com/v2/translate", urlValues)
    if err != nil {
        return "", err
    }
    defer func(Body io.ReadCloser) {
        err := Body.Close()
        if err != nil {
            fmt.Printf("DeepL translate error: %+v\n", err)
        }
    }(resp.Body)

    if resp.StatusCode != http.StatusOK {
        return "", err
    }

    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        return "", err
    }

    result := map[string][]map[string]string{}
    err = jsoniter.Unmarshal(body, &result)
    if err != nil {
        return "", err
    }
    return result["translations"][0]["text"], err
}

original text: 物を言ってるような感覚になってしまう でも海外からはアイコンタクトはとても 大事っていう形でまあそこの中でもいろ んなトラブルが起こってきたんですねな ぜかっていうとあの私たちの先生たちは アイコンタクトをちゃんと持ってあの授 業をするわけなんですけれどもある方が ちょっと誤解してしまってそれは恋愛感 情で自分を見つめられてるっていうよう な形からあのちょっと本当に大きなトラ ブルになってきたことがありましたでそ ういうところを通しながらまたある時に はですねあの謝るとか感謝するとかって いうところが日本人は何回も何回もあの するっていう習慣があると思うんですね でそれを例えば1週間後に例えば感謝の 気持ちを表さなかったって言ったら私た ちの先生がええあの感謝の気持ちが足り ないんじゃないかみたいな形で誤解し

These texts are presented by OCR and may be different from the real thing。

I wonder if it's caused by illegal strings in the content, such as "?" inside the image.

JanEbbing commented 4 months ago

From which language to which language are you translating this input? I can translate it fine into british English and Chinese (it puts "���� " at the end, which was present in the source text). I think the most likely culprit is the encoding - our API returns UTF-8 encoded strings, your system may default to a different one when encoding what the API returns, resulting in these weird characters. To fix that, I'd need to know where you translate this (a terminal shell, some webserver, etc), how you display it, etc.

image
gouyuwang commented 4 months ago

The Golang project is deployed on top of a linux machine located in Hong Kong's server room. The original text is uploaded by the client as a speech stream, and the text is generated by our speech recognition service, and then the DeepL translation interface is called via http, and the translation result is passed back to the chrome client via websocket. The above phenomenon is not a frequent occurrence, about two times in six months. @JanEbbing

JanEbbing commented 4 months ago

Yes but how/where is the text rendered? As I've shown, the API returns the characters well-formated in UTF-8. If you get artifacts like the ones in the screenshot, it is most likely an encoding issue somewhere in this pipeline.