OiWorld / MindTheWord

An extension for Google Chrome that helps people learn new languages while they browse the web.
43 stars 47 forks source link

Reverse-translate from user-defined translation lists #51

Open chris838 opened 8 years ago

chris838 commented 8 years ago

After testing the app with the new user-defined translations feature for a week or two, it's become pretty clear to me what additional feature would bring about the largest improvement with the smallest amount of effort (for me personally at least, I can only speak for myself).

The main problem is that the user-defined translation map gives very crude, context-blind translations that often appear confusing. This disrupts my ability to fluently read and comprehend the text, normally to the point where I have to disable the extension out of frustration.

My proposed solution would be to "reverse" translate: use Yandex or Google to translate the entire page into the target language; select user-defined words and phrases within the translated page; revert the remaining text back into the source language whilst keeping the selected text in the target language.

I haven't really started scoping this out yet, but obvious problems I can already foresee:

chris838 commented 8 years ago

Expanding upon the last problem mentioned above, here's some example English text translated into Chinese:

Original: Thousands of people gathered in the centre of Ankara. Google: 成千上万的人聚集在安卡拉的中心。 Yandex: 数千人聚集的中心安卡拉。

Notice how Google correctly places 的中心 (meaning in the centre of) at the end of the sentence. In an ideal world, how would we best present this as a combination of English and Chinese for the immersive learner? Here would be my best attempt:

Thousands of people gathered in Ankara 的中心.

However without true understanding of either language, the best the app would be able to do would probably be one of the following:

Google: Thousands of people gathered Ankara 的中心. Yandex: Thousands of people gathered 的中心 Ankara.

Both of these seem acceptable here, but would be interesting to see other cases where the resulting translation is nonsensical. We should bare in mind that Google/Yandex translation isn't perfect either.

ceilican commented 8 years ago

Does translating entire pages in this way constitute fair usage of Google/Yandex? Are we likely to exceed free translation limits or have our API key blocked?

Yandex has a 10 million character per month limit. If we assume that a page has an average of 5000 characters, then the limit is 2000 pages per month, or about 66 pages per day. For some users, this limit might be ok. For others, not.

Does Google/Yandex give us access to the mapping between source and target text that we would need?

No, unfortunately they don't. In the case of Google, when we request the translation of a single word, they provide additional grammatical information about that word. In the case of Yandex, I am not sure. In both cases, no grammatical information is provided when the translation of a longer sentence is requested.

Since making a separate request for every word turned out to be too slow (and also forbidden by Google), MindTheWord concatenates many words into a single string, with words separated by dots. In the received translation, the translated words are also separated by dots, which are used to split the string into separate translated words. This is obviously a hacky solution, and it does compromises the translation quality, because Yandex and Google use statistical machine translation and try to understand the concatenated words as if they formed an actual sentence.

How do we resolve differences in grammar and word ordering?

I think that is a hard and yet unsolved research problem.

Clearly, MindTheWord is more useful for languages with similar grammatical structure (e.g. translating between European languages). For languages with different grammatical structures, the user must accept that MindTheWord (as it is now) will only be helpful for memorizing words, but not for learning grammar.

However, this thread of comments gave me an idea: what if we had an option that allowed Mind The Word to translate whole sentences instead of isolated words? This option would be useful for users who already have a sufficiently good vocabulary and who are interested in grammatically well-formed sentences. By translating whole sentences, we can rely on Google/Yandex to correctly change the word order for us. What do you think of this idea? It is not exactly what you want, but it seems simpler to implement.

ceilican commented 8 years ago

However without true understanding of either language, the best the app would be able to do...

Exactly. Google/Yandex do not give us sufficient information about the grammatical structure of the language. They do not give us the true understanding of the languages that we would need for achieving that. Doing this ourselves for all supported languages would be too much work. I am not aware of a usable free service capable of receiving a natural language sentence as input and returning a parse tree for that sentence, especially not a for a wide range of languages. This topic in natural language processing is still a very active research area. If you would like to do research on that, let me know, and I can try to put you in touch with people who are doing that.

chris838 commented 8 years ago

Agree with almost everything above and I think your idea of translating whole sentences could prove to be a possible compromise (although it might be hard to limit sentence translation to a manageable difficulty level). I'm still considering the best way forward (low on free time currently).

One thing I would point out (in case you weren't aware) - the Google Translate web page does preserve some kind of mapping between input and output text. Hovering over the output text you can see groups of words highlighted, with corresponding highlighting visible in the input text.

ceilican commented 8 years ago

One thing I would point out (in case you weren't aware) - the Google Translate web page does preserve some kind of mapping between input and output text. Hovering over the output text you can see groups of words highlighted, with corresponding highlighting visible in the input text.

Yes, I am aware, but somehow I forgot to take it into account in my previous comments. Thank you for reminding me.