Stichoza / google-translate-php

🔤 Free Google Translate API PHP Package. Translates totally free of charge.
MIT License
1.77k stars 379 forks source link

Low quality translation compared to google live translator #163

Closed fissben closed 2 years ago

fissben commented 3 years ago

I noticed that current repo isn't accurate translate anymore. Looks like it happened few weeks ago.

For example, im trying to translate from "en" to "ru" this phrase: My apologies about my messages, hope they weren't too inconvenient. Hope everything will get back to normal soon.

Here is what I got from google in browser Приношу свои извинения по поводу моих сообщений, надеюсь, они не были слишком неудобными. Надеюсь, что скоро все вернется на круги своя.

While library translate it like this: Мои извинения о моих сообщениях, надеюсь, они не были слишком неудобны. Надеюсь, что все скоро вернется к нормам.

Which have a way more direct translation.

Any thoughts ?

Stichoza commented 3 years ago

Well, that's strange. Google sometimes provides multiple translations. I'll debug and see how to get most relevant one.

fissben commented 3 years ago

They also changed a way, how to deal with their server, new url came: https://translate.google.com/_/TranslateWebserverUi/data/batchexecute..

i got only one user-agent to get same result, like we have in this library Mozilla/5.0 (compatible; MSIE 10.0; Windows Phone 8.0; Trident/6.0; IEMobile/10.0; ARM; Touch; NOKIA; Lumia 520)

But even this one works through new endpoint.

Blair2004 commented 3 years ago

Hi, that's also what i've noticed.

The Google Translate (website) know some common brand terms... like WordPress, Elemenetor, etc. But the Google used on the package doesn't know that and will try to convert it into a random similar term for example, I tried to translate the sentence: "How to create a mega menu with Elementor"... it will convert Elementor to "Emoror", "Emerer", "Elementaire", "Element"... Which doesn't means nothing on the destination language (french).

I tried the approach of adding terms that shouldn't be translated on a tag with class "notranslate", but the translation is even worse.

Stichoza commented 3 years ago

Looked through the response coming from the server while using current URL. It does come with multiple versions of translation, but none of them are as good as ones translated by Google Translate website. The new URL that @fissben mentioned.

I guess we'll have to reverse engineer the new algorithm (cookies, etc). I'll post more updates here.

fissben commented 3 years ago

Here is working example of current endpoint (curl)

curl --location --request POST 'https://translate.google.com/_/TranslateWebserverUi/data/batchexecute?rpcids=MkEWBc&hl=ru&soc-app=1&soc-platform=1&soc-device=1&_reqid=53165&rt=c' \
--header 'authority: translate.google.com' \
--header 'pragma: no-cache' \
--header 'cache-control: no-cache' \
--header 'sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"' \
--header 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' \
--header 'content-type: application/x-www-form-urlencoded;charset=UTF-8' \
--header 'origin: https://translate.google.com' \
--header 'referer: https://translate.google.com/' \
--header 'accept-language: en-US,en;q=0.9,ru-UA;q=0.8,ru;q=0.7,ja-JP;q=0.6,ja;q=0.5,zh-CN;q=0.4,zh-TW;q=0.3,zh;q=0.2,uk;q=0.1' \
--header 'Cookie: NID=213=jMxpp4AcB9CbhtqMEgj78zOxP-71uc_Q_ku6ov-Ffd9FJYrCtiF5xLiWOBZtmQnBnvOXFJMY9qOjEBIA1o5HjiJwWZNisKzNHRO2ekwlsIfQJLsVMdaCBV0X_tNl4QVHbu6sWYniCdkXjDtVMjwID7EAtwTD2WpnD4p_Pr6F48hb_ffQMYXaWYNQDxgmb30jgTi4u0vLfaE1KddtC7E' \
--data-raw 'f.req=%5B%5B%5B%22MkEWBc%22%2C%22%5B%5B%5C%22%D0%9F%D0%B5%D1%80%D1%88%D0%B8%D0%B9%20%D0%BD%D0%B0%D1%86%D1%96%D0%BE%D0%BD%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%B9%20%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD%20%D0%BF%D0%B5%D1%80%D0%B5%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0%D1%87%5C%22%2C%5C%22uk%5C%22%2C%5C%22en%5C%22%2Ctrue%5D%2C%5Bnull%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AD08yZn3jSHJ2pLXRNZ-gYpVGrLd%3A1618314364485&'

After short investigation I've found that last part of payload at=AD08yZn3jSHJ2pLXRNZ-gYpVGrLd%3A1618314364485& is most important. Param at is what we are looking for. Somehow it generating hash of payload and then checking it on the backend. How it helps in reverse-engineering

ermeh commented 3 years ago

Yes, quality of translation is rather low when comparing with google translate. Why is that?

henno commented 3 years ago

We noticed this sudden degradation of translation quality a couple of weeks ago as well. Just found this issue. Has anyone made any tests after 18th of Apr or there any new information about as to why the quality of the traslations changed suddenly?

Blair2004 commented 3 years ago
curl --location --request POST 'https://translate.google.com/_/TranslateWebserverUi/data/batchexecute?rpcids=MkEWBc&hl=ru&soc-app=1&soc-platform=1&soc-device=1&_reqid=53165&rt=c' \
--header 'authority: translate.google.com' \
--header 'pragma: no-cache' \
--header 'cache-control: no-cache' \
--header 'sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"' \
--header 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' \
--header 'content-type: application/x-www-form-urlencoded;charset=UTF-8' \
--header 'origin: https://translate.google.com' \
--header 'referer: https://translate.google.com/' \
--header 'accept-language: en-US,en;q=0.9,ru-UA;q=0.8,ru;q=0.7,ja-JP;q=0.6,ja;q=0.5,zh-CN;q=0.4,zh-TW;q=0.3,zh;q=0.2,uk;q=0.1' \
--header 'Cookie: NID=213=jMxpp4AcB9CbhtqMEgj78zOxP-71uc_Q_ku6ov-Ffd9FJYrCtiF5xLiWOBZtmQnBnvOXFJMY9qOjEBIA1o5HjiJwWZNisKzNHRO2ekwlsIfQJLsVMdaCBV0X_tNl4QVHbu6sWYniCdkXjDtVMjwID7EAtwTD2WpnD4p_Pr6F48hb_ffQMYXaWYNQDxgmb30jgTi4u0vLfaE1KddtC7E' \
--data-raw 'f.req=%5B%5B%5B%22MkEWBc%22%2C%22%5B%5B%5C%22%D0%9F%D0%B5%D1%80%D1%88%D0%B8%D0%B9%20%D0%BD%D0%B0%D1%86%D1%96%D0%BE%D0%BD%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%B9%20%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD%20%D0%BF%D0%B5%D1%80%D0%B5%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0%D1%87%5C%22%2C%5C%22uk%5C%22%2C%5C%22en%5C%22%2Ctrue%5D%2C%5Bnull%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AD08yZn3jSHJ2pLXRNZ-gYpVGrLd%3A1618314364485&'

When I still test this curl now it returns an output... so probably the value of "at" doesn't expire?

Blair2004 commented 3 years ago

The work now is to extract the translated string from what looks like an incomplete json returned by Google.

464
[["wrb.fr","MkEWBc","[[\"Pershyy natsionalʹnyy onlayn perekladach\",null,null,[[[0,[[[null,37]\n]\n,[true]\n]\n]\n]\n,37]\n]\n,[[[null,null,null,null,null,[[\"First National On-line Translator\",[\"First National On-line Translator\",\"The first national online translator\"]\n]\n]\n]\n]\n,\"en\",1,\"uk\",[\"Перший національний онлайн перекладач\",\"uk\",\"en\",true]\n]\n]\n",null,null,null,"generic"]
,["di",156]
,["af.httprm",155,"6538938918244503432",158]
]
26
[["e",4,null,null,536]
]
Stichoza commented 3 years ago

so probably the value of "at" doesn't expire?

It's possible that it doesn't expire, however the value of that parameter differs for each different string. It's some kind of hash but I cannot find out how to generate it

Blair2004 commented 3 years ago

What abotu using this

curl 'https://www.google.com/async/translate?vet=12ahUKEwjT7Maf1O_wAhUDBGMBHdhEBykQqDgwAHoECAIQJg..i&ei=XZqyYJPKMIOIjLsP2ImdyAI&yv=3' \
  -H 'authority: www.google.com' \
  -H 'sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' \
  -H 'content-type: application/x-www-form-urlencoded;charset=UTF-8' \
  -H 'accept: */*' \
  -H 'origin: https://www.google.com' \
  -H 'x-client-data: CIe2yQEIpLbJAQipncoBCOH2ygEIqJ3LAQigoMsBCKygywEI8fDLAQiB8ssBCNzyywEIqPPLARiOnssBGJH1ywE=' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://www.google.com/' \
  -H 'accept-language: en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,sw-TZ;q=0.6,sw;q=0.5,es;q=0.4,de;q=0.3' \
  -H 'cookie: SEARCH_SAMESITE=CgQIyZIB; SID=-AdTEiYBHQxwyi6tM21CKmx0c4Y1a4q433Cacx-mQACfzwIH0I1fT0wH7pVmUnd_kK_QOA.; __Secure-3PSID=-AdTEiYBHQxwyi6tM21CKmx0c4Y1a4q433Cacx-mQACfzwIHKrDd2zvJ_c5gnPl4a2MMeA.; HSID=Ayll6tzipONjmj1m4; SSID=AqEGOJ5K3d9zhN9JB; APISID=iY7F7EQuWx9eYks1/AAK2jCg-YLchJ2xIx; SAPISID=2ZivtOqq29XgIoHy/ArdWxr5KbTd4RtM9J; __Secure-3PAPISID=2ZivtOqq29XgIoHy/ArdWxr5KbTd4RtM9J; OTZ=5999998_52_52__52_; NID=216=Ni17mzF6uLOBNG4iasK6JP9GjDmN9BbP-VFSNdu6KgFipkAdhdzCVYo9IWOCbkvmHa6HYd7VAaWO40EnGURxQYczydEHbQFatNbk5wDnZwBw0I8aJN8xlpNDynCxs5vHahDdOSFuEt2ppr-BK90W816xk3QOlzDgU1pyHWv0dJqMEVbpSNDIxUCZAJz8GO1oJq5fv1JfJQDYYZ1BJO6EUXww8kdlmGIrNzhmAAvKHUnhu7PKv98OY6EHT39EMC187f1ewAVZV7zlSgcAKNEzgxcFh6PhtMHH6srqOkxkm0E-6oK1l5KBZZSXkvDvDXu_bD-2t8hj0m8-R7hASU5u9AiScP7zjcxumRtpEt1vRA9WHeLCY-EZQ5R8T1A7vpigqpsh9x8O9zOqRkgXcq4R7zL-ww3ohf3chjkQwLX5J9xLnMreSKQ; 1P_JAR=2021-05-29-19; DV=w-LWwaovX_lHMO5HzDUyvBMlILCam9c7hOlmyewo2gAAACAcuD9TMkTrYAAAAGDtw7_cZzBvRwAAAA; UULE=a+cm9sZTogMQpwcm9kdWNlcjogMTIKdGltZXN0YW1wOiAxNjIyMzE3NjY0MTU1MDAwCmxhdGxuZyB7CiAgbGF0aXR1ZGVfZTc6IDM4NDY5NjMyCiAgbG9uZ2l0dWRlX2U3OiAxMTUwMTU2ODAKfQpyYWRpdXM6IDQ3NTA0NDAKcHJvdmVuYW5jZTogNgo=; SIDCC=AJi4QfHkAEuJQkjqHKaVrOSGMBerdz9iiZVsPsE2rw2KWEfGkcMczh3Oo7pwg-Mjmz1EqsE-YrF3; __Secure-3PSIDCC=AJi4QfHUrZjD571gCm5-jqaOQULhDdqmb5ql92leEnpszMcN1eHpL0R-xACOJwmdgoQyIoE1zBBv' \
  --data-raw 'async=translate,sl:fr,tl:en,st:Hello%20There,id:1622317680604,qc:true,ac:true,_id:tw-async-translate,_pms:s,_fmt:pc' \
  --compressed

Which is the endpoint used while on the SERP of Google image

It doesn't seems to have a signature.

Blair2004 commented 3 years ago

Also i'm using a Google Chrome extension for translating selected text on the web page... probably by looking at the source code, we can see how they proceed.

image

Blair2004 commented 3 years ago

Here is the request used by the extension. I think it can also be used :

curl 'https://translate.googleapis.com/translate_a/t?anno=3&client=tee&format=html&v=1.0&key&logld=vTE_20210503_00&sl=auto&tl=it&tc=2&sr=1&tk=67691.518207&mode=1' \
  -H 'authority: translate.googleapis.com' \
  -H 'sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' \
  -H 'content-type: application/x-www-form-urlencoded' \
  -H 'accept: */*' \
  -H 'origin: https://wptavern.com' \
  -H 'x-client-data: CIe2yQEIpLbJAQipncoBCKidywEIoKDLAQisoMsBCNzyywEIqPPLARiOnssB' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://wptavern.com/' \
  -H 'accept-language: en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,sw-TZ;q=0.6,sw;q=0.5,es;q=0.4,de;q=0.3' \
  --data-raw 'q=Skip%20to%20content&q=WordPress%20Tavern&q=%C2%B7&q=WordPress%20News%20%E2%80%94%20Free%20as%20in%20Beer.&q=Search%20for%3A&q=%0A%09%09%09Navigation%09%09&q=About&q=Contact&q=Podcast&q=News&q=Opinion&q=Plugins&q=Themes&q=Events&q=The%20Automattic%20Theme%20Team%20Announces%20Blockbase%2C%20Its%20New%20Block%20Parent%20Theme&q=%3Ca%20i%3D0%3EJustin%20Tadlock%3C%2Fa%3E%3Ca%20i%3D1%3E%C2%B7%3C%2Fa%3E&q=May%2028%2C%202021&q=%3Ca%20i%3D0%3E%C2%B7%3C%2Fa%3E%3Ca%20i%3D1%3ENo%20Comments%3C%2Fa%3E&q=Any%20WordPress%20company%20that%20builds%20and%20maintains%20themes%20worth%20its%20salt%20is%20already%20doing%20at%20least%20some%20preliminary%20work%20as%E2%80%89%E2%80%A6%E2%80%89&q=Continue%20reading%C2%A0The%20Automattic%20Theme%20Team%20Announces%20Blockbase%2C%20Its%20New%20Block%20Parent%20Theme%C2%A0%E2%86%92&q=Happy%2018th%20Birthday%2C%20WordPress&q=%3Ca%20i%3D0%3ESarah%20Gooding%3C%2Fa%3E%3Ca%20i%3D1%3E%C2%B7%3C%2Fa%3E&q=May%2027%2C%202021&q=WordPress%20is%20celebrating%2018%20years%20today%20since%20the%20first%20release%20of%20the%20software%20to%20the%20general%20public.%20That%20release%20post%2C%E2%80%89%E2%80%A6%E2%80%89&q=Continue%20reading%C2%A0Happy%2018th%20Birthday%2C%20WordPress%C2%A0%E2%86%92&q=Gutenberg%2010.7%20Integrates%20With%20the%20Pattern%20Directory%2C%20Introduces%20New%20Block%20Design%20Controls' \
  --compressed
henno commented 3 years ago

Here is the request used by the extension. I think it can also be used :

curl 'https://translate.googleapis.com/translate_a/t?anno=3&client=tee&format=html&v=1.0&key&logld=vTE_20210503_00&sl=auto&tl=it&tc=2&sr=1&tk=67691.518207&mode=1' \
  -H 'authority: translate.googleapis.com' \
  -H 'sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' \
  -H 'content-type: application/x-www-form-urlencoded' \
  -H 'accept: */*' \
  -H 'origin: https://wptavern.com' \
  -H 'x-client-data: CIe2yQEIpLbJAQipncoBCKidywEIoKDLAQisoMsBCNzyywEIqPPLARiOnssB' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://wptavern.com/' \
  -H 'accept-language: en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,sw-TZ;q=0.6,sw;q=0.5,es;q=0.4,de;q=0.3' \
  --data-raw 'q=Skip%20to%20content&q=WordPress%20Tavern&q=%C2%B7&q=WordPress%20News%20%E2%80%94%20Free%20as%20in%20Beer.&q=Search%20for%3A&q=%0A%09%09%09Navigation%09%09&q=About&q=Contact&q=Podcast&q=News&q=Opinion&q=Plugins&q=Themes&q=Events&q=The%20Automattic%20Theme%20Team%20Announces%20Blockbase%2C%20Its%20New%20Block%20Parent%20Theme&q=%3Ca%20i%3D0%3EJustin%20Tadlock%3C%2Fa%3E%3Ca%20i%3D1%3E%C2%B7%3C%2Fa%3E&q=May%2028%2C%202021&q=%3Ca%20i%3D0%3E%C2%B7%3C%2Fa%3E%3Ca%20i%3D1%3ENo%20Comments%3C%2Fa%3E&q=Any%20WordPress%20company%20that%20builds%20and%20maintains%20themes%20worth%20its%20salt%20is%20already%20doing%20at%20least%20some%20preliminary%20work%20as%E2%80%89%E2%80%A6%E2%80%89&q=Continue%20reading%C2%A0The%20Automattic%20Theme%20Team%20Announces%20Blockbase%2C%20Its%20New%20Block%20Parent%20Theme%C2%A0%E2%86%92&q=Happy%2018th%20Birthday%2C%20WordPress&q=%3Ca%20i%3D0%3ESarah%20Gooding%3C%2Fa%3E%3Ca%20i%3D1%3E%C2%B7%3C%2Fa%3E&q=May%2027%2C%202021&q=WordPress%20is%20celebrating%2018%20years%20today%20since%20the%20first%20release%20of%20the%20software%20to%20the%20general%20public.%20That%20release%20post%2C%E2%80%89%E2%80%A6%E2%80%89&q=Continue%20reading%C2%A0Happy%2018th%20Birthday%2C%20WordPress%C2%A0%E2%86%92&q=Gutenberg%2010.7%20Integrates%20With%20the%20Pattern%20Directory%2C%20Introduces%20New%20Block%20Design%20Controls' \
  --compressed

If you change a single character in --data-raw, you'll get

Your client does not have permission to get URL /translate_a/t?anno=3&client=tee&format=html&v=1.0&key&logld=vTE_20210503_00&sl=auto&tl=it&tc=2&sr=1&tk=67691.518207&mode=1 from this server.

taiviemthoi commented 3 years ago

Hi @Blair2004 @henno I know tk=67691.518207 was born based on content translate and I tried generating token for new content using Stichoza\GoogleTranslate\Tokens\GoogleTokenGenerator but not working so can you tell me how the token is generated?

Blair2004 commented 3 years ago
curl --location --request POST 'https://translate.google.com/_/TranslateWebserverUi/data/batchexecute?rpcids=MkEWBc&hl=ru&soc-app=1&soc-platform=1&soc-device=1&_reqid=53165&rt=c' \
--header 'authority: translate.google.com' \
--header 'pragma: no-cache' \
--header 'cache-control: no-cache' \
--header 'sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"' \
--header 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' \
--header 'content-type: application/x-www-form-urlencoded;charset=UTF-8' \
--header 'origin: https://translate.google.com' \
--header 'referer: https://translate.google.com/' \
--header 'accept-language: en-US,en;q=0.9,ru-UA;q=0.8,ru;q=0.7,ja-JP;q=0.6,ja;q=0.5,zh-CN;q=0.4,zh-TW;q=0.3,zh;q=0.2,uk;q=0.1' \
--header 'Cookie: NID=213=jMxpp4AcB9CbhtqMEgj78zOxP-71uc_Q_ku6ov-Ffd9FJYrCtiF5xLiWOBZtmQnBnvOXFJMY9qOjEBIA1o5HjiJwWZNisKzNHRO2ekwlsIfQJLsVMdaCBV0X_tNl4QVHbu6sWYniCdkXjDtVMjwID7EAtwTD2WpnD4p_Pr6F48hb_ffQMYXaWYNQDxgmb30jgTi4u0vLfaE1KddtC7E' \
--data-raw 'f.req=%5B%5B%5B%22MkEWBc%22%2C%22%5B%5B%5C%22%D0%9F%D0%B5%D1%80%D1%88%D0%B8%D0%B9%20%D0%BD%D0%B0%D1%86%D1%96%D0%BE%D0%BD%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%B9%20%D0%BE%D0%BD%D0%BB%D0%B0%D0%B9%D0%BD%20%D0%BF%D0%B5%D1%80%D0%B5%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0%D1%87%5C%22%2C%5C%22uk%5C%22%2C%5C%22en%5C%22%2Ctrue%5D%2C%5Bnull%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AD08yZn3jSHJ2pLXRNZ-gYpVGrLd%3A1618314364485&'

When I still test this curl now it returns an output... so probably the value of "at" doesn't expire?

I ended using this. I created a custom guzzle request and i used DomQuery to be able to extract the language here is how the code looks like :

$client     =   new Client;
        $request    =   $client->request( 'POST', 'https://www.google.com/async/translate?vet=12ahUKEwjT7Maf1O_wAhUDBGMBHdhEBykQqDgwAHoECAIQJg..i&ei=XZqyYJPKMIOIjLsP2ImdyAI&yv=3', [
            'headers'   =>  [
                'authority'         =>  'www.google.com',
                'sec-ch-ua'         =>  'Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
                'sec-ch-ua-mobile'  =>  '?0',
                'user-agent'        =>  collect( $this->randomUserAgent )->shuffle()->first(),
                'content-type'      =>  'application/x-www-form-urlencoded;charset=UTF-8',
                'accept'            =>  '*/*',
                'origin'            =>  'https://www.google.com',
                'x-client-data'     =>  'CIe2yQEIpLbJAQipncoBCOH2ygEIqJ3LAQigoMsBCKygywEI8fDLAQiB8ssBCNzyywEIqPPLARiOnssBGJH1ywE=',
                'sec-fetch-site'    =>  'same-origin',
                'sec-fetch-mode'    =>  'cors',
                'sec-fetch-dest'    =>  'empty',
                'referer'           =>  'https://www.google.com/',
                'accept-language'   =>  'en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,sw-TZ;q=0.6,sw;q=0.5,es;q=0.4,de;q=0.3',
                'cookie'            =>  'SEARCH_SAMESITE=CgQIyZIB; SID=-AdTEiYBHQxwyi6tM21CKmx0c4Y1a4q433Cacx-mQACfzwIH0I1fT0wH7pVmUnd_kK_QOA.; __Secure-3PSID=-AdTEiYBHQxwyi6tM21CKmx0c4Y1a4q433Cacx-mQACfzwIHKrDd2zvJ_c5gnPl4a2MMeA.; HSID=Ayll6tzipONjmj1m4; SSID=AqEGOJ5K3d9zhN9JB; APISID=iY7F7EQuWx9eYks1/AAK2jCg-YLchJ2xIx; SAPISID=2ZivtOqq29XgIoHy/ArdWxr5KbTd4RtM9J; __Secure-3PAPISID=2ZivtOqq29XgIoHy/ArdWxr5KbTd4RtM9J; OTZ=5999998_52_52__52_; NID=216=Ni17mzF6uLOBNG4iasK6JP9GjDmN9BbP-VFSNdu6KgFipkAdhdzCVYo9IWOCbkvmHa6HYd7VAaWO40EnGURxQYczydEHbQFatNbk5wDnZwBw0I8aJN8xlpNDynCxs5vHahDdOSFuEt2ppr-BK90W816xk3QOlzDgU1pyHWv0dJqMEVbpSNDIxUCZAJz8GO1oJq5fv1JfJQDYYZ1BJO6EUXww8kdlmGIrNzhmAAvKHUnhu7PKv98OY6EHT39EMC187f1ewAVZV7zlSgcAKNEzgxcFh6PhtMHH6srqOkxkm0E-6oK1l5KBZZSXkvDvDXu_bD-2t8hj0m8-R7hASU5u9AiScP7zjcxumRtpEt1vRA9WHeLCY-EZQ5R8T1A7vpigqpsh9x8O9zOqRkgXcq4R7zL-ww3ohf3chjkQwLX5J9xLnMreSKQ; 1P_JAR=2021-05-29-19; DV=w-LWwaovX_lHMO5HzDUyvBMlILCam9c7hOlmyewo2gAAACAcuD9TMkTrYAAAAGDtw7_cZzBvRwAAAA; UULE=a+cm9sZTogMQpwcm9kdWNlcjogMTIKdGltZXN0YW1wOiAxNjIyMzE3NjY0MTU1MDAwCmxhdGxuZyB7CiAgbGF0aXR1ZGVfZTc6IDM4NDY5NjMyCiAgbG9uZ2l0dWRlX2U3OiAxMTUwMTU2ODAKfQpyYWRpdXM6IDQ3NTA0NDAKcHJvdmVuYW5jZTogNgo=; SIDCC=AJi4QfHkAEuJQkjqHKaVrOSGMBerdz9iiZVsPsE2rw2KWEfGkcMczh3Oo7pwg-Mjmz1EqsE-YrF3; __Secure-3PSIDCC=AJi4QfHUrZjD571gCm5-jqaOQULhDdqmb5ql92leEnpszMcN1eHpL0R-xACOJwmdgoQyIoE1zBBv',
            ],
            'proxy'                 =>  $proxy,
            'form_params'           =>  [
                'async' =>  'translate,sl:' . $sourceLanguage . ',tl:' . $destination . ',st:' . urlencode( $text ) . ',id:1622317680604,qc:true,ac:true,_id:tw-async-translate,_pms:s,_fmt:pc'
            ]
        ]);

        $dom    =   '<div>' . ( ( string ) $request->getBody() ) . '</div>';
        $query  =   new DomQuery( $dom );

        return $query->find( '#tw-answ-target-text' )->text();

So far it works, we only need to figure out the accuracy of the translation.

sudofox commented 3 years ago

Any news on this? Translating back and forth between Japanese and English and the ones I get back are much worse than the ones obtained via Google Translate's web interface directly

Blair2004 commented 3 years ago

Hi, the solution I've shared so far work for me, but i'm forced to do many requests to Google which makes me end up with a too many request exception. So I've investigated to see how Google generates the "tk" query parameters from the Google Translate extension.

It looks like the value is generated based on the content, that's why as @henno has mentioned if the body of the request is modified, the whole request it's no more valid. So as in the below image, I've found the function that generates the token using the translated string.

screenshot-newtab-2021 06 20-20_20_09

The function itself looks like this.

image

I've just made the finding, I'll investigate more and see how i can create a similar function on PHP to generate that token. But this should be a nice improvement to the library as we'll also be able to send an array of strings to translate to Google.

sudofox commented 3 years ago

You're awesome!!

Blair2004 commented 3 years ago

Hi, i'm coming with some new updates. So, in order to use the function that generates the "tk" token, we need to get a key that is only available on a file provided by Google itself: https://translate.google.com/translate_a/element.js

image

That token should be used with a class that generate the token. I created a sample class.

class TokenGenerator {
    function getKey( $text, $token ) {
        $tokenExploded  =   explode( '.', $token );
        $prefix         =   ( int ) $tokenExploded[0] ?? 0;

        for( 
            $data   =   [],
            $eIndex     =   0,
            $fIndex     =   0;
            $fIndex < strlen( $text ); $fIndex++
        ) {
            $stringPosition     =   $this->charCodeAt( $text, $fIndex );

            if ( 128 > $stringPosition ) {
                $data[$eIndex++]    =   $stringPosition;
            } else {
                if ( 2048 > $stringPosition ) {
                    $data[$eIndex++] = $stringPosition >> 6 | 192;
                } else if ( 
                  55296 == ( $stringPosition & 64512 ) && 
                  $fIndex + 1 < count( $text ) && 
                  56320 == $this->charCodeAt( $text, $fIndex + 1 ) & 64512 
                ) {
                    $stringPosition     =   65536 + ( ( $stringPosition & 1023 ) << 10 ) + $this->chartCodeAt( ++$fIndex ) & 1023;
                    $data[$eIndex++]    =   $stringPosition >> 18 | 240;
                    $data[$eIndex++]    =   $stringPosition >> 12 & 63 | 128;
                } else {
                    $data[$eIndex++]    =   $stringPosition >> 12 | 224;
                    $data[$eIndex++]    =   $stringPosition >> 6 & 63 | 128;
                    $data[$eIndex++]    =   $stringPosition & 63 | 128;
                }
            }
        }

        $text   =   $token;

        for( $e = 0; $e < count( $data ) ; $e++ ) {
            $text   +=  $data[$e];
            $text   =   $this->jrChars( $text, '+-a^+6' );
        }

        $text   =   $this->jrChars( $text, '+-3^+b+-f' );
        $text   ^=  ( int ) $tokenExploded[1] ?? 0;

        if ( 0 > $text ) {
            $text   =   ( ( $text & 2147483647 ) + 2147483648 );
        }

        return ( ( string ) $text %1E6 ) . ( '.' ) . ( $tokenExploded ^ $token );         
    }

    function charCodeAt($string, $offset) {
        $string = mb_substr($string, $offset, 1);
        list(, $ret) = unpack('S', mb_convert_encoding($string, 'UTF-16LE'));
        return $ret;
    }

    function jrChars($a, $b) {
        for ($c = 0; $c < strlen( $b ) - 2; $c += 3) {
            $d = substr( $b, $c + 2);
            $d = "a" <= $d ? $this->charCodeAt( $d, 0 ) - 87 : ( int ) $d;
            $d = "+" == substr( $b, $c + 1) ? $a >> $d : $a << $d;
            $a = "+" == substr( $b, $c ) ? $a + $d & 4294967295 : ( $a ^ $d );
        }

        return $a;
    }
}

$generator  =   new TokenGenerator;
$generator->getKey( 'Hello World', "451185.3571800534" ); // output : 493811.451184

I'll now do tests with Google to see whether it's effective or not.

sudofox commented 3 years ago

i tried my hand at it and your example class is a bit broken for me (I changed it to one implementing TokenProviderInterface and added the interface method, but it was wacky, especially around charCodeAt), so I spent a bit trying to dig up the source from within the gtranslate webapp page. After a bit of hacking around, I was able to produce this: https://gist.github.com/sudofox/3b7c5b75472392e15891537f0dae2325

It's what you see starting here:

Screenshot from 2021-06-22 11-29-39

which is deeply nested inside more uglified evals inside JS objects, not going to track back to where I found it (just searched for one of the magic numbers in your example function to find it)

Relevant part no. 1:

      jp = function(u, S, z, I, D, f, A, K, J, q, Q, x, k) {
        for (f = (I = J = 0, []); J < S.length; J++) q = S.charCodeAt(J), 128 > q ? f[I++] = q : (2048 > q ? f[I++] = (D = q >> 6, -193 - 2 * ~(D | 192) + (~D | 192)) : (55296 == -~q + (~q ^ 64512) + (~q & 64512) && J + 1 < S.length && 56320 == (K = S.charCodeAt(J + 1), (K | 0) + (~K ^ 64512) - (K | -64513)) ? (q = 65536 + ((q & u) << 10) + (x = S.charCodeAt(++J), -2 * ~(x & u) - 1 + ~x + (x & -1024)), f[I++] = q >> 18 | 240, f[I++] = (Q = q >> 12 & 63, 128 + (Q & -129))) : f[I++] = (k = q >> 12, (k | 0) + ~(k & 224) - -225), f[I++] = (A = q >> 6 & 63, z - (~A ^ 128) - (~A & 128))), f[I++] = (q | 0) + (q & -64) - 2 * (q ^ 63) + 2 * (~q & 63) | 128);
        return f
      },

Going to add more comments when I get a sec. Let's solve this together!

Blair2004 commented 3 years ago

i tried my hand at it and your example class is a bit broken for me (I changed it to one implementing TokenProviderInterface and added the interface method, but it was wacky, especially around charCodeAt), so I spent a bit trying to dig up the source from within the gtranslate webapp page. After a bit of hacking around, I was able to produce this: https://gist.github.com/sudofox/3b7c5b75472392e15891537f0dae2325

It's what you see starting here:

Screenshot from 2021-06-22 11-29-39

which is deeply nested inside more uglified evals inside JS objects, not going to track back to where I found it (just searched for one of the magic numbers in your example function to find it)

Relevant part no. 1:

      jp = function(u, S, z, I, D, f, A, K, J, q, Q, x, k) {
        for (f = (I = J = 0, []); J < S.length; J++) q = S.charCodeAt(J), 128 > q ? f[I++] = q : (2048 > q ? f[I++] = (D = q >> 6, -193 - 2 * ~(D | 192) + (~D | 192)) : (55296 == -~q + (~q ^ 64512) + (~q & 64512) && J + 1 < S.length && 56320 == (K = S.charCodeAt(J + 1), (K | 0) + (~K ^ 64512) - (K | -64513)) ? (q = 65536 + ((q & u) << 10) + (x = S.charCodeAt(++J), -2 * ~(x & u) - 1 + ~x + (x & -1024)), f[I++] = q >> 18 | 240, f[I++] = (Q = q >> 12 & 63, 128 + (Q & -129))) : f[I++] = (k = q >> 12, (k | 0) + ~(k & 224) - -225), f[I++] = (A = q >> 6 & 63, z - (~A ^ 128) - (~A & 128))), f[I++] = (q | 0) + (q & -64) - 2 * (q ^ 63) + 2 * (~q & 63) | 128);
        return f
      },

Going to add more comments when I get a sec. Let's solve this together!

Hi yes I noticed my class doesn't generate the right token. I actually tried to covert the original javascript functions into a php, look like I did a mistake somewhere.

Blair2004 commented 3 years ago

Hi, small update from my end. I haven't been able to make the token encoder "tk" work properly, then i decided to involve JavaScript. Since the code is made using Javascript, then the work turns easier then.

Translator (JS)

So, I created a class for JavaScript that translates a file (JSON) into a defined language. This uses the Google translation version used on the extension and this extension provide a better translation result.

Major Benefit

What I like with this approach is that you can submit a list of strings, have it translated and returned just using one request. Previously using the package, I wasn't able to do that, so for a text that has 20 paragraphs, I was forced to perform 20 requests and as I mentioned already I ended with a "Too Many Requests" Exception (maybe I was using that in a wrong way).

How It works

1 - I load the file provided by Google that has a key... I then need to create a virtual browser (on NodeJS) so that the token can be added to a window variable. 2 - I create a class that uses the functions extracted from the Google Chrome Translator extension, that will be used to generate the token. 3 - I read a provided JSON file to translate (as an argument on the CLI) and join them to issue the token. 4 - I run the class with all the necessary information and I get as a result an array of the translated string.

screenshot-www scrapingbee com-2021 06 26-11_44_46

Now, I'm not sure how we can make this work with this package. I have to highlight how easier it was to make this work with NodeJS, so I believe we need somehow to have JavaScript involved. I'm out of ideas for now, what do you think can be the possible steps to go here?

henno commented 3 years ago

@sudofox How is your progress?

sudofox commented 3 years ago

I hate to break it to you but I did a bit of cost analysis on my project and found that the usage of the paid API fell far under the "free" limit, so I switched to the official one. Good luck though :0

henno commented 3 years ago

I hate to break it to you but I did a bit of cost analysis on my project and found that the usage of the paid API fell far under the "free" limit, so I switched to the official one. Good luck though :0

Same with me. Unfortunately this project is somewhat useless unless this issue is resolved.

henno commented 3 years ago

What happens when you replace

'client' => 'webapp',

with

'client' => 'gtx',

in vendor/stichoza/google-translate-php/src/GoogleTranslate.php?

henno commented 2 years ago

I got the quality issue fixed by changing the client from webapp to gtx. Does anyone know what the gtx stands for and will there be any side effects from changing the client from webapp to gtx?

henno commented 2 years ago

@sudofox I also tried the "official way" but it was so complicated that after spending an hour trying fix permission issues with Google Cloud, I gave up. Then I found https://github.com/statickidz/php-google-translate-free which appeared to produce higher quality translations and then I snooped in its source code and found this commit which was made on the same day this issue was opened. I noticed that he had changed the client and tried the same in this project and what do you know, it worked.

henno commented 2 years ago

@Stichoza could you try to change the client and try if it fixes this issue for you and if it does, close this issue and release a new version.

carlosvaldesweb commented 2 years ago

Hello @henno i've changed the client as you mentioned, but the quality is low still. I had see the statickidz branch, but i couldn't solve it. could you help me? Also i've deleted the others params in dt.

protected $urlParams = [
        'client'   => 'gtx',
        'hl'       => 'en',
        'dt'       => [
            't',   // Translate
        ],
        'sl'       => null, // Source language
        'tl'       => null, // Target language
        'q'        => null, // String to translate
        'ie'       => 'UTF-8', // Input encoding
        'oe'       => 'UTF-8', // Output encoding
        'multires' => 1,
        'otf'      => 0,
        'pc'       => 1,
        'trs'      => 1,
        'ssel'     => 0,
        'tsel'     => 0,
        'kc'       => 1,
        'tk'       => null,
    ];
henno commented 2 years ago

@arcanaer Did you change webapp to gtx in vendor/stichoza/google-translate-php/src/GoogleTranslate.php?

carlosvaldesweb commented 2 years ago

@henno Yes, sorry, it was an error in my code implementation, but i can confirm that i you use my code above, it works with better translation.

Stichoza commented 2 years ago

Thanks @henno! 🥳 It works and I released a new version v4.1.5.

Also added ->setClient() method so if you want to use low quality translation you can type ->setClient('webapp')

henno commented 2 years ago

Hello all,

I noticed that the quality has gone down again. Can anyone verify this?