d4n3436 / GTranslate

A collection of free translation APIs (Google Translate, Bing Translator, Microsoft Translator and Yandex.Translate).
MIT License
64 stars 10 forks source link

Add 19 additional languages #3

Closed David-Maisonave closed 2 years ago

David-Maisonave commented 2 years ago

This pull request is only for LanguageDictionary.cs and AggregateTranslator.cs. I'm sorry, but I'm not sure how to remove the other two files from the Pull Request.

Changes to the LanguageDictionary.cs include the following:

I added an extra try catch on AggregateTranslator.cs, because in some cases an exception was thrown inside the catch code.

FYI: I changed the csproj file, because it didn't compile with my project while it had the reference to .net 2.x on it. Feel free to ignore those changes.

d4n3436 commented 2 years ago

Hello, and thanks for contributing. Could you please edit the code so the only changes are made to the _languages dictionary? The extra comments are unnecessary too. Also, you can undo the changes to the other files through your code editor.

19 Additional languages

Could you please provide your sources for these new languages? I obtained the languages from official documentation and inspecting the code from the translator pages, and I couldn't find any new languages. I tested some of the new languages and some work, some don't, and some are just aliases like es-MX or en-GB. This can be determined because the API returns the language codes as es and en. Please make sure all the new languages are working and that their NativeName properties contain the native name and not the English name. Preferably use Wikipedia to get the native names (for consistency with the other languages).

Class to allow external exposure to the language list.

The language list can already be accessed through the Language.LanguageDictionary property. It's a static property so the list is only created once.

The language list is sorted, so as to more easily identify any duplicates, and it makes it easier to find a language on the list.

The language list was already sorted by language name but I'm OK with the new order.

I added an extra try catch on AggregateTranslator.cs, because in some cases an exception was thrown inside the catch code.

There's no way an exception could be thrown there. The code is just adding Exception objects to the list. Could you provide a sample code that produces an exception there?

David-Maisonave commented 2 years ago

Hi,

The extra comments are unnecessary too.

I'll remove the extra comments.

Could you please edit the code so the only changes are made to the _languages dictionary?

I made the change to AggregateTranslator.cs, because the logic adding an exception to a list was throwing an exception. It kept happening when http error 429 would occur. If you're sure you don't want that change, I can remove until you pull the changes.

I see the LanguageServiceDetails class as an essential class to the GTranslate library. There has to be a way for any program using GTranslate to be able to tell what languages are supported, and which language tags to use. By exposing the supported language details, there's no need to have the "alias" logic, because the calling application doesn't have to just blindly send a request. Moreover, with my testing, this alias logic cause multiple problems, until I got rid of it.

Also, you can undo the changes to the other files through your code editor.

I'm going to use my branch of GTranslate to keep the changes that don't make it into the main branch, because I need these changes for my project. If you're not able to select which files to take, I'll temporarily remove the changes in the other files until you pull in the changes. I can also temporarily put back the alias code as well.

Could you please provide your sources for these new languages?

I used the name of the languages from the following link:

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Individual Translating Service Source:

Recently Google added 24 languages:

However, I could not find a good list showing what language tags they're using for the new languages. I was only able to get a little over half of them to work.

Microsoft Service Pack and Language Interface Pack

FYI: I lived in the Philippines for 4 years (1983-1987) when I was station in Clark AB Philippines. And back then, a person saying that the locals spoke Filipino was like saying Mexicans spoke Mexican. So I was a little annoyed when I saw Filipino listed in MS Language Pack and on Bing Translator. But I did very recently do a google search on Tagalog v.s. Filipino, and much to my surprise they consider them two different languages. https://learningfilipino.com/blog/difference-between-tagalog-and-filipino/

However, Filipino is derived from Tagalog.

With that said, I think it's a good idea to keep both languages on the list, but with each one referencing the other.

David-Maisonave commented 2 years ago

Since last post was very long, I'm answering in multiple post.

I tested some of the new languages and some work, some don't, and some are just aliases like es-MX or en-GB. This can be determined because the API returns the language codes as es and en.

Please keep in mind that I removed the alias logic when testing. In my code, I'm never going to use the alias logic, and IMHO, it should be removed, or the API should have an option to bypass the alias logic. Every time a new language is added, the alias logic well always be a concern. I tested all the languages that I added, and they worked for me. I also speak Spanish, so it was easy to verify es-MX.

Can you give me some details on what's not working.

I recommend you try the ABetterTranslator. It has an installer, and there's a tab you can use to test an individual language. It works on the fly, while the user is typing in the text. I'm going to update the program today, with the new version having a feature to save a set of languages.

Please make sure all the new languages are working and that their NativeName properties contain the native name and not the English name. Preferably use Wikipedia to get the native names (for consistency with the other languages).

I didn't test the languages using either the English name nor the native name. My code only uses the language tag when sending a translation request. The English names, and the native names are only used in the GUI.

The language list can already be accessed through the Language.LanguageDictionary property. It's a static property so the list is only created once.

I did not realize that!I've taken out that class and associated logic.

I'm mainly a C++ programmer, so there's some aspects to C# which are still new to me. Static classes is something I just very recently learn while trying to address a different issue.

There's no way an exception could be thrown there. The code is just adding Exception objects to the list. Could you provide a sample code that produces an exception there?

I also thought the same thing. I believe some code corruption is triggering the issue. I'll see if I can reproduce the issue, and get more details.

David-Maisonave commented 2 years ago

Sorry, but there's something I should have mention in my last post. I used the ABetterTranslator program on the program itself to make it multilingual. It was working just find until yesterday when more then a dozen new strings were added. That gave it a total of 200 strings to translate. At that point 22 languages failed to translate. These 22 languages include languages that were already in GTranslate.

The program still works on a Resx file that has 77 strings, so I believe the failure has to do with the quantity of strings and the assigned translator.

What is common about all the failed languages, is that they're all the languages which explicitly only has Google as the translator. If you look at the below screenshot, all the languages except for Somali are listed as having only Google as the translator.

The22FailedGoogleOnlyLanguages

Now, yesterday Somali was failing, and it only had Google as the translator. Today, I merged the code with the latest changes on GTranslate, which change Somali to having the multiple translator.

After that change, Somali now works.

I originally thought the Google Translator was failing, but now I think there has to be something else within the GTranslate code that is triggering the issue.

Corsican is the first language alphabetically that was already in GTranslate, before my changes. So I tried to reproduce the issue with that language. But I can't seem to reproduce the issue manually. Here's the error I get when it runs through the normal process: GTranslate_..:toLang(co):fromLang(en):Translator=(AggregateTranslator): No translator provided a valid result. (Response status code does not indicate success: 429 (Too Many Requests).) (Unable to get the data from the response.)

Is there anything unique about languages having only Google Translate that could trigger this issue?

d4n3436 commented 2 years ago

Please keep in mind that I removed the alias logic when testing. In my code, I'm never going to use the alias logic, and IMHO, it should be removed, or the API should have an option to bypass the alias logic.

The dictionary of aliases was created to make GTranslate easier to use by recognizing that a specific string belongs to a language name, alternative name or native name and then obtaining its ISO code. Removing the dictionary of aliases would be bad for usability and performance because instead of making 2 lookups the code would have to check all entries in the dictionary and search for a match in 3 properties.

I tested all the languages that I added, and they worked for me. I also speak Spanish, so it was easy to verify es-MX.

I know, but I was referring to the fact that the API itself returns an es code when translating text to es-MX indicating that the API treats es-MX and es the same, thus making the addition of es-MX unnecessary.

I didn't test the languages using either the English name nor the native name. My code only uses the language tag when sending a translation request. The English names, and the native names are only used in the GUI.

I was referring to update the new languages to have their NativeName properties actually be the native names and not the English name.

An example:

["sa"] = new("Sanskrit", "Sanskrit", "sa", "san", TranslationServices.Google),

The second parameter should be the native name, which is "संस्कृतम्".

Is there anything unique about languages having only Google Translate that could trigger this issue?

That's happening because Google is IP banning you for making too many requests in a short period of time. The 429 error code reflects that, and that affects both GoogleTranslator and GoogleTranslator2 classes. Since all available translators have thrown an exception, the AggregateTranslator will throw an AggregateException because it can't continue.

The second exception (Unable to get the data from the response.) comes from GoogleTranslator2. That happens when the translator can't parse the response because the string that is supposed to contain the (actual) translation response is null.

David-Maisonave commented 2 years ago

I know, but I was referring to the fact that the API itself returns an es code when translating text to es-MX indicating that the API treats es-MX and es the same, thus making the addition of es-MX unnecessary.

How would the API do this without the alias code?

The second parameter should be the native name, which is "संस्कृतम्".

I understand now. But where are you getting the native names from? I did a quick google search, and I didn't see a list of ISO 639-1 having the native names.

The second exception (Unable to get the data from the response.) comes from GoogleTranslator2.

When does the code use GoogleTranslator v.s. GoogleTranslator2?

That's happening because Google is IP banning you for making too many requests in a short period of time.

That's what I originally thought. But that wouldn't explain why Somali failed when only Google service was assigned, and ran successfully when Bing and Microsoft was added.

That is unless the code is setup to jump to another translation service when one service fails. Maybe I missed it, but I didn't see that type of logic in GTranslate.

In any case, I did find a workaround for this error. I had an upper limit of 10,000 characters per translation request. Every language started working after I lower the upper limit to 5,000. The code still translates the same amount of data, but when the limit is reached, it breaks up the translation into multiple translation requests. It now makes twice as many translation request with few characters in each request.

This further proves that the issue is not related to having too many request, and instead it's being triggered because of the amount of characters in a single translation request.

But I'm still not sure why 10,000 is a good limit for almost all the languages except for those only having Google as the translating service.

David-Maisonave commented 2 years ago

I did find a link with native names: https://omniglot.com/language/names.htm And it has an Excel spreadsheet download link.

I'll update the code with the native names.

d4n3436 commented 2 years ago

How would the API do this without the alias code?

I don't understand what you mean.

When does the code use GoogleTranslator v.s. GoogleTranslator2? That is unless the code is setup to jump to another translation service when one service fails.

That's what AggregateTranslator does. It uses the first available translator and if that translator throws an exception then it uses the next one and so on. If there are no translators left, it throws an AggregateException containing all the exceptions.

You can see the order of the translators in AggregateTranslator.cs:

public AggregateTranslator()
    : this(new GoogleTranslator(), new GoogleTranslator2(), new MicrosoftTranslator(), new YandexTranslator(), new BingTranslator())
{
}

In any case, I did find a workaround for this error. I had an upper limit of 10,000 characters per translation request. Every language started working after I lower the upper limit to 5,000.

Both GoogleTranslator and GoogleTranslator have a limit of 5000 characters. If that limit is excceded then the API truncates the text. YandexTranslator has a limit of 10000 characters. BingTranslator has a limit of 1000 characters and MicrosoftTranslator has a limit of 50000 characters.

David-Maisonave commented 2 years ago

The following line has the wrong label: ["zh-TW"] = new("Chinese (Traditional)", "繁體中文 (繁體)", "zh-TW", "zho-TW", TranslationServices.Google | TranslationServices.Bing | TranslationServices.Microsoft), That should be Chinese (Traditional, Taiwan) or Chinese (Taiwan). That was bothering me for a while, but I didn't know what was the correct code for Chinese (Traditional) until you posted the following link: https://api.cognitive.microsofttranslator.com/languages?api-version=3.0&scope=translation

That line should be changed to the following two lines:

        ["zh-Hant"] = new("Chinese (Traditional)", "繁體中文 (繁體)", "zh-Hant", "zh-Hant",  TranslationServices.Microsoft),
        ["zh-TW"] = new("Chinese (Taiwan)", "中文(臺灣", "zh-TW", "zho-TW", TranslationServices.Google | TranslationServices.Bing | TranslationServices.Microsoft),

I ran a test, and all three of the languages selected in below snapshot give different results per their associated language.

ChineseTraditional
d4n3436 commented 2 years ago

zh-Hant and zh-Hans are ISO-15924 language codes that only BingTranslator/MicrosoftTranslator use. zh-Hant is equivalent to zh-TW and zh-Hans is equivalent to zh-CN and there's already aliases for them so they redirect to zh-TW and zh-CN.

David-Maisonave commented 2 years ago

FYI: Just to give you a heads-up, I plan to have my own variation of GTranslate which will have languages that don't make it on this update. I'm also going to add the following changes.

d4n3436 commented 2 years ago

This is FOSS software, feel free to fork the project and adjust it to your needs. Be sure to follow the conditions of the MIT license.

This would've been merged much faster if it wasn't for the changes to existing languages and the addition of non-existing/redundant languages, but I appreciate the contribution nonetheless.