UlionTse / translators

🌏🌍🌎Translators🌎🌍🌏 is a library that aims to bring free, multiple, enjoyable translations to individuals and students in Python. Translators是一个旨在用Python为个人和学生带来免费、多样、愉快翻译的库。
https://pypi.org/project/translators/
GNU General Public License v3.0
1.66k stars 193 forks source link

[Feature]: Normalized translation return object/dict #139

Open jbscout opened 1 year ago

jbscout commented 1 year ago

Expect to happened

It would be very useful if there was a way to get a normalized return object/dict from a translation request.

I like that translate_text() returns a dict of the raw data from the translator. The problem is that each translator returns a dict that is vastly different from every other translator's dict format.

Bing returns:

{'detectedLanguage': {'language': 'da', 'score': 1.0}, 'translations': [{'text': 'Letter from the Danish Environmental Protection Agency (2023_05)', 'to': 'en', 'sentLen': {'srcSentLen': [33], 'transSentLen': [64]}}]}

Google returns:

{'data': [[None, None, 'da', [[[0, [[[None, 33]], [True]]]], 33], [['Brev fra Miljøstyrelsen (2023_05)', None, None, 33]], None, ['Brev fra Miljøstyrelsen (2023_05)', 'auto', 'en', True]], [[[None, None, None, None, None, [['Letter from the Danish Environmental Protection Agency (2023_05)', None, None, None, [['Letter from the Danish Environmental Protection Agency (2023_05)', [5], []], ['Letter from the Danish Environmental Protection Agency (2023_05)', [11], []]]]], None, None, None, []]], 'en', 1, 'da', ['Brev fra Miljøstyrelsen (2023_05)', 'auto', 'en', True]], 'da']}

This makes it very difficult to parse the returned value of translate_text() if you switch translators, or if they decide to change their return format.

As the project's goal is to make translation agnostic of which translator I use, it would be nice if the the project's API provided me with a consistently formatted return value (either via translate_text() or a new function ). The API could parse and map the translator's returned dict into a normalized dict.

What I am looking for is translate_text() to return a dict, regardless of what translation engine was used, with the following keys

Maybe make a new function call that does this. So, that translate_text() remains backwards compatible. Or, put in an input parameter in the **kwargs (e.g., :param if_normalize_dict: bool, default False) that changes the returned dict from the current type to the normalized type.

Another option is a host of functions that provide the same information atomically from the last time translate_text() was run.

For example, detectLanugage(), translatedText(), translatorUsed(), etc.

Thank you

Expected APP Version

next newest version

Expected Python Version

=3.8 (Default)

Expected Runtime Environment

NoArch (Default)

Country/Region

Denmark

Expected Output

{'detectedLanguage': 'da', 'detectedLanguage_score': 1.0, 'targetLanguage': 'en', 'originalText':'Brev fra Miljøstyrelsen (2023_05)', 'translatedText':'Letter from the Danish Environmental Protection Agency (2023_05)', 'translatorUsed':'bing', 'rawReturnedDict': {'detectedLanguage': {'language': 'da', 'score': 1.0}, 'translations': [{'text': 'Letter from the Danish Environmental Protection Agency (2023_05)', 'to': 'en', 'sentLen': {'srcSentLen': [33], 'transSentLen': [64]}}]}}

Code of Conduct

UlionTse commented 1 year ago

@jbscout

Good advice. I have also thought about similar questions, but where is the application scenario of so many outputs, and is the output a repetition of the input? Where is my maintenance focus, what is the core. My own answer is that my focus and core is on accurate and more translation services, not derivative features. I even wanted to cut is_detail_result=True at one point. It has also been proposed to output non-auto from_language, but in fact, predicting which language a piece of text is in has a specialized library to do this, with high accuracy and little difficulty, and does not require translation services to provide it. Everyone has different personal needs, and it is more important for this library to provide stable core functionality. Thanks.

ManuelSchneid3r commented 1 year ago

(Also answering your comment on #140) I see your point (especially dropping is_detail_result which is useless since the output is unpredictable), but then again the language detection of other services/libraries is probably by far not as good in as the available translation services, which are most like based on deep recurrent neural networks, which do an excellent job at precisely this task. I am not sure how you extract the translated text, but if it is a hardcoded accessor adding the detected source language wouldnt be that much of a hassle, if it is there at all, but I assume that most tranlators provide it.

Some thoughs on the proposed return data:

Bildschirmfoto 2023-09-01 um 13 55 08 Bildschirmfoto 2023-09-01 um 13 55 16 Bildschirmfoto 2023-09-01 um 13 55 24

As you can see there is a need.

Regards