[Feature]: Normalized translation return object/dict

jbscout commented 1 year ago

Expect to happened

It would be very useful if there was a way to get a normalized return object/dict from a translation request.

I like that translate_text() returns a dict of the raw data from the translator. The problem is that each translator returns a dict that is vastly different from every other translator's dict format.

Bing returns:

{'detectedLanguage': {'language': 'da', 'score': 1.0}, 'translations': [{'text': 'Letter from the Danish Environmental Protection Agency (2023_05)', 'to': 'en', 'sentLen': {'srcSentLen': [33], 'transSentLen': [64]}}]}

Google returns:

{'data': [[None, None, 'da', [[[0, [[[None, 33]], [True]]]], 33], [['Brev fra Miljøstyrelsen (2023_05)', None, None, 33]], None, ['Brev fra Miljøstyrelsen (2023_05)', 'auto', 'en', True]], [[[None, None, None, None, None, [['Letter from the Danish Environmental Protection Agency (2023_05)', None, None, None, [['Letter from the Danish Environmental Protection Agency (2023_05)', [5], []], ['Letter from the Danish Environmental Protection Agency (2023_05)', [11], []]]]], None, None, None, []]], 'en', 1, 'da', ['Brev fra Miljøstyrelsen (2023_05)', 'auto', 'en', True]], 'da']}

This makes it very difficult to parse the returned value of translate_text() if you switch translators, or if they decide to change their return format.

As the project's goal is to make translation agnostic of which translator I use, it would be nice if the the project's API provided me with a consistently formatted return value (either via translate_text() or a new function ). The API could parse and map the translator's returned dict into a normalized dict.

What I am looking for is translate_text() to return a dict, regardless of what translation engine was used, with the following keys

detectedLanguage (the detected, not "auto")
detectedLanguage_score
targetLanguage
originalText
translatedText
translatorUsed
rawReturnedDict (maybe, which would be the dict currently returned from translate_text() )

Maybe make a new function call that does this. So, that translate_text() remains backwards compatible. Or, put in an input parameter in the **kwargs (e.g., :param if_normalize_dict: bool, default False) that changes the returned dict from the current type to the normalized type.

Another option is a host of functions that provide the same information atomically from the last time translate_text() was run.

For example, detectLanugage(), translatedText(), translatorUsed(), etc.

Thank you

Expected APP Version

next newest version

Expected Python Version

=3.8 (Default)

Expected Runtime Environment

NoArch (Default)

Country/Region

Denmark

Expected Output

{'detectedLanguage': 'da', 'detectedLanguage_score': 1.0, 'targetLanguage': 'en', 'originalText':'Brev fra Miljøstyrelsen (2023_05)', 'translatedText':'Letter from the Danish Environmental Protection Agency (2023_05)', 'translatorUsed':'bing', 'rawReturnedDict': {'detectedLanguage': {'language': 'da', 'score': 1.0}, 'translations': [{'text': 'Letter from the Danish Environmental Protection Agency (2023_05)', 'to': 'en', 'sentLen': {'srcSentLen': [33], 'transSentLen': [64]}}]}}

Code of Conduct

[X] I agree to follow this project's Code of Conduct

UlionTse commented 1 year ago

@jbscout

Good advice. I have also thought about similar questions, but where is the application scenario of so many outputs, and is the output a repetition of the input? Where is my maintenance focus, what is the core. My own answer is that my focus and core is on accurate and more translation services, not derivative features. I even wanted to cut is_detail_result=True at one point. It has also been proposed to output non-auto from_language, but in fact, predicting which language a piece of text is in has a specialized library to do this, with high accuracy and little difficulty, and does not require translation services to provide it. Everyone has different personal needs, and it is more important for this library to provide stable core functionality. Thanks.

ManuelSchneid3r commented 1 year ago

(Also answering your comment on #140) I see your point (especially dropping is_detail_result which is useless since the output is unpredictable), but then again the language detection of other services/libraries is probably by far not as good in as the available translation services, which are most like based on deep recurrent neural networks, which do an excellent job at precisely this task. I am not sure how you extract the translated text, but if it is a hardcoded accessor adding the detected source language wouldnt be that much of a hassle, if it is there at all, but I assume that most tranlators provide it.

Some thoughs on the proposed return data:

detectedLanguage: You asked for a usecase. Its obvious to my eyes that one of the major use cases of this library (especially the auto feature) will be used to provide translation services directly to users. If the language detection is distincitve enough specifying the source langauge is redundant and saves time. As a concrete example see my keyboard launcher plugin:

As you can see there is a need.

detectedLanguage_score: Maybe as well useful if displayed to the user, but if it does not come with an array of tranlations each with a score, i dont see much gain in it. Besides I dont know if translation engines provide it at all.
targetLanguage, originalText, translatedText, translatorUsed: These are redundant. The client knows what he passes to the fuction.
rawReturnedDict (maybe, which would be the dict currently returned from translate_text() ): As mentioned above the structure of this dict is unpredictable and, as such, useless to the client.

Regards

UlionTse / translators