Animenosekai / translate

A module grouping multiple translation APIs
GNU Affero General Public License v3.0
500 stars 59 forks source link

Not accurate source language autodetection #74

Open joeperpetua opened 1 year ago

joeperpetua commented 1 year ago

Hi! First of all wanted to say that I love the project, have been using it for a while now.

I came across some bizarre behavior that maybe you could check or maybe explain to me (I tried checking the source code for the functions but did not see anything relevant that could be causing this).

In this case, it seems that the source language autodetection is a bit off when giving it short and single words. I reproduced it with Spanish, but I don't know if it does happen in other languages too. In this case, if you give the words "casa" or "hola" for example, it will detect the source language as English instead of Spanish.

For example using the base translator:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.Translator().language("casa")
LanguageResult(service=Google, source=casa, result=eng)

Then I tried using the translators explicitly, in this case Reverso and Google, then using the base translator again, and it worked correctly (I guess because of the cache, but I may be wrong):

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=spa)
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)
>>> translatepy.Translate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)

But interestingly enough, then, in the same session, using the base translator with the method translate(), the detection was off again:

>>> translatepy.Translate().translate("casa", "en")
TranslationResult(service=Google, source=casa, source_language=eng, destination_language=eng, result=casa)

Any ideas of why could be this happening? I guess the workaround by know would be to run the GoogleTranslate().language() method, and then the Translator().translate() method to get accurate results, like so:

>>> lang = translatepy.translators.google.GoogleTranslate().language("casa")
>>> translatepy.Translate().translate("casa", "en", lang.result)
TranslationResult(service=Google, source=casa, source_language=spa, destination_language=eng, result=house)

Anyway, wanted to ask about this and see if there is any reasoning behind it. Sorry for the long message and thanks in adavance !

ZhymabekRoman commented 1 year ago

Thanks for reporting this! This is strange, in my case even class GoogleTranslate doesn't recognize the language correctly. Problems seem to be on Google server side

➜  translate git:(main) ipython3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import translatepy

In [2]:  translatepy.translators.google.GoogleTranslateV1().language("casa")
Out[2]: LanguageResult(service=Google, source=casa, result=eng)
➜  translate git:(main) ipython3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import translatepy

In [2]:  translatepy.translators.google.GoogleTranslateV2().language("casa")
Out[2]: LanguageResult(service=Google, source=casa, result=eng)
➜  translate git:(main) ipython3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import translatepy

In [2]:  translatepy.translators.google.GoogleTranslate().language("casa")
Out[2]: LanguageResult(service=Google, source=casa, result=eng)
joeperpetua commented 1 year ago

Thanks for the response! I experimented a little more, and it does seem that Google Translate is the issue. Also, it seems that the first response will influence the subsequent results. For example: Used GoogleTranslate() first, got result=eng. But then used Reverso, and the result was the same as the one from Google:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=eng)
>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=eng)

But, if you use Reverso first, then the result will be correct when using Google Translate:

Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import translatepy
>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=spa)
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)

Could this be related to the cache mechanism?

Animenosekai commented 1 year ago

(I guess because of the cache, but I may be wrong)

Yes, I would guess the same !

But interestingly enough, then, in the same session, using the base translator with the method translate(), the detection was off again

This is normal, because some translators, such as Google Translate, already returns the source language with their translation endpoint, and some need to first call the language endpoint.

So, even if you called the language endpoint first with Google Translate, the source language would be the one returned by the translation endpoint.

The weirdest thing is that Google Translate returned Spanish though.

Looking at the official website, we see that indeed the detected language is English

Screenshot 0005-01-23 at 21 18 59
Animenosekai commented 1 year ago

Also, it seems that the first response will influence the subsequent results

Now this is weird, because it shouldn't lol

This is the part where the GET cache is returned

https://github.com/Animenosekai/translate/blob/490767c8ed89466cf71a9d76ccff5dfd63dcd51c/translatepy/utils/request.py#L179-L181

For the translator cache, here is the part where it gets the cache

https://github.com/Animenosekai/translate/blob/490767c8ed89466cf71a9d76ccff5dfd63dcd51c/translatepy/translators/base.py#L318-L320

But that's weird because we clearly see that you are creating two different instances of the Translator class

>>> translatepy.translators.reverso.ReversoTranslate().language("casa")
LanguageResult(service=Reverso, source=casa, result=spa)
>>> translatepy.translators.google.GoogleTranslate().language("casa")
LanguageResult(service=Google, source=casa, result=spa)
joeperpetua commented 1 year ago

Well, just found a very interesting behavior (or bug) from Google Translate. It seems that it will detect a different language depending on the language of your Google account. For example: GA - English | detects English: image GA - Spanish | detects Spanish: image GA - French and German | detect Portuguese: image image From this, I guess that the best would be to just clean the cache in the production server and then go with Reverso to get the language and pass it explicitly.

Animenosekai commented 1 year ago

Well, just found a very interesting behavior (or bug) from Google Translate. It seems that it will detect a different language depending on the language of your Google account. For example:

Wow now that's interesting...

I guess it might be a feature to guess better the expected result.

Animenosekai commented 1 year ago

But then it might change the result based on the service URL used πŸ€”

Animenosekai commented 1 year ago

Just confirmed it:

>>> from translatepy.translators.google import GoogleTranslate
>>> g = GoogleTranslate(service_url="translate.google.es")
>>> g.language("casa")
LanguageResult(service=Google, source=casa, result=spa)
>>> g = GoogleTranslate(service_url="translate.google.fr")
>>> g.language("casa")
LanguageResult(service=Google, source=casa, result=spa)
>>> g.clean_cache()
>>> g.language("casa")
LanguageResult(service=Google, source=casa, result=por)

And yes something is happening with the caches

joeperpetua commented 1 year ago

But that's weird because we clearly see that you are creating two different instances of the Translator class

Well, that is interesting indeed, I would have totally blamed it in the cache to be honest lol

I guess it might be a feature to guess better the expected result. But then it might change the result based on the service URL used πŸ€”

Yeah, but I think it kinda makes sense for words that are the same in different languages, for example casa is the same in Spanish, Portuguese and Italian, so if your GA is set in Italian, the detection will go with Italian: image

joeperpetua commented 1 year ago

And yes something is happening with the caches

Well, that is something lol, I tried checking in the source code before, but my python skills are not that sharp πŸ˜… maybe you have a better eye to catch what's going on lol

ZhymabekRoman commented 1 year ago

And yes something is happening with the caches

It's not a bug, it's a feature. When I designed the V2 translatepy architecture, I make a one cache instance avaible for all BaseTranslate class instances. In practice, it doesn't seem to be a good idea. If required, I can make PR to fix this, and integrate new LRU cache logic (https://github.com/Animenosekai/translate/issues/58). https://github.com/Animenosekai/translate/blob/490767c8ed89466cf71a9d76ccff5dfd63dcd51c/translatepy/translators/base.py#L51-L62 Caches initializes as class attributes, not instance. More info: https://stackoverflow.com/a/207128/13452914

Animenosekai commented 1 year ago

When I designed the V2 translatepy architecture, I make a one cache instance avaible for all BaseTranslate class instances.

Yes, I think this should be changed because people using translators separately expect different results from each instance.

Moreover, if they want a shared cache, they might just use the Translate class.

Also yea you can PR the new LRU logic anytime you want !

joeperpetua commented 1 year ago

Thank you all guys for the help πŸ™ŒπŸ™Œ

ZhymabekRoman commented 1 year ago

New PR done: https://github.com/Animenosekai/translate/pull/76

➜  translate git:(main) ipython
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from translatepy.translators.google import GoogleTranslate

In [2]: g = GoogleTranslate(service_url="translate.google.es")

In [3]: g.language("casa")
Out[3]: LanguageResult(service=Google, source=casa, result=spa)

In [4]: g = GoogleTranslate(service_url="translate.google.fr")

In [5]: g.language("casa")
Out[5]: LanguageResult(service=Google, source=casa, result=por)