42wim / matterbridge

bridge between mattermost, IRC, gitter, xmpp, slack, discord, telegram, rocketchat, twitch, ssh-chat, zulip, whatsapp, keybase, matrix, microsoft teams, nextcloud, mumble, vk and more with REST API (mattermost not required!)
Apache License 2.0
6.66k stars 618 forks source link

Allow transformation of messages via Google Translate #499

Open patcon opened 6 years ago

patcon commented 6 years ago

Describe the solution you'd like I am interested in whether matterbridge could be used to help unfragment international communities for which language barriers make collaboration difficult. It would be wonderful if matterbridge could be used to create translatable gateways between tools, or even between different channels within the same tool.

So for example, the g0v-tw Slack could bridge it's #general channel with general-en, and messages could be translated into each "ends" target language.

I can imagine a generalized feature request being for "transformation of messages within gateways in a directional fashion". But the specific request would be to discuss allowing translation via Google Translate's API.

I could imagine this involving a pluggable system that also powered the reformatting of markup in between platforms.

Describe alternatives you've considered Forking the project.

Additional context None.

Thanks for your consideration!

42wim commented 6 years ago

Well, it's an interesting idea. But right now, I don't have time to work on this myself. If you know/have someone who can work on this, feel free to open a PR and I can give some pointers if necessary.

patcon commented 6 years ago

Ok, started investigating how a proof-of-concept might work.

Eager to hear your feedback! Thanks in advance for any attention.

(As I'm sure you recall, I'm not a golang programmers, so this might be messy, but happy to take a stab at it 😃 )

Configuration

This seems like the appropriate way that we might strive to configure:

# config.toml
[[gateway]]
name="g0v-tw.translation"
enable=true
translate=true
  [[gateway.inout]]
  account="slack.g0v-tw"
  channel="general"
    [gateway.inout.options]
    locale="zh-TW"
  [[gateway.inout]]
  account="slack.g0v-tw"
  channel="general-en"
    [gateway.inout.options]
    locale="en"

Could forego translate toggle, and simply assume that a locale and google translate key mean that incoming messages should be translated. (I prefer this approach tbh.)

Implementation

I would then add code either to:

API

In order to translate messages, we would use the following API endpoint:

POST https://translation.googleapis.com/language/translate/v2

Params:

Transformation

Since Google Translate goes a little overboard, we'll want to mark some features of messages as non-translatable. Specifically @usernames, #channels, urls, and `code snippets`. We'll do this by regex'ing for each, and wrapping them in <span translate="no"></span> tags. These tags will remain in the translated strings, but can be processed out again.

A proof-of-concept may also need to generally strip markdown in order for Google Translate to work well. We can subsequently (after PoC) bring this back by converting markdown into HTML (which Google Translate can handle), and then back to markdown.

Attribution

Apparently, we must also add "Powered by Google Translate" to messages.

42wim commented 6 years ago

Looks good! Calling a new function in modifyMessage would be the best. You'll also need an API key key, for the google translate endpoint.

patcon commented 6 years ago

Thanks for vetting and encouragement :) ~modifyMessage already exists, but I'll just choose any old name for now and happy to change later~

patcon commented 6 years ago

Calling a new function in modifyMessage would be the best.

Ok, after digging around a bit, I'm confused by this suggestion, and was hoping you could help me understand :)

https://github.com/42wim/matterbridge/blob/296428d53e4febb5a82082d3c61628fbd396fd13/gateway/router.go#L97-L108

I appears that modifyMessage in handleRecieve is what is called based on gateway-level config that applies to every message that comes out of the gateway, and it makes changes that apply to all destination channels. It seems to be handleMessage that is called for each of the potential many generated messages going into other channels.

Is this correct? If so, then the latter is where the new functionality must be added, because each message must be translated differently based on the Locale settings of the channel it's being dropped into.

Any clarification of my understanding is appreciated! Thanks @42wim!

42wim commented 6 years ago

Yes, your suggestion is correct. (sorry for sending you in the wrong direction)

patcon commented 6 years ago

screen shot 2018-10-08 at 4 53 01 pm

Yay! Bare working PoC!

patcon commented 6 years ago

This still needs some work, but I just wanted to say that this is WORKING SO WELL! I can already feel it slightly changing how our community is able to communicate, and it's pretty neat!

screen shot 2018-10-11 at 6 17 04 pm

Thank you so so so SO much for this tool @42wim :)))

42wim commented 6 years ago

Looks pretty cool, good job!

patcon commented 6 years ago

Hey @42wim! Hopefully a quick question:

I'm running into a bit of trouble with the fact that I'm totally transforming the text, and I don't think this was the original intention. It seems that msg.Text is being passed between all bridges, and not as separate instances of the original message. This is unexpected, as I'd like each channel to receive a translation of the original post in the original language. But it keeps mutating.

It took me awhile to notice this, as Google Translate only cares about the target language, and auto-detects the origin language, whatever it may be.

So assume I have a gateway with 4 rooms and 4 langauges: english, korean, chinese, and japanese. It seems that handleMessage() for the first bridge might transform the Text from english to chinese, then that chinese text is transformed into korea, then korea into japanese. This tends to garble the original message, as it send it through 3 layers of translation.

Does it seem that my understanding is correct? If so, can you think of a way around this? (My first thought would be to stay the original message on a value in the Gateway struct, but that's probably wrong.)

As always, thanks for any assistance! :)

patcon commented 6 years ago

Nevermind. Figured it out! Hadn't seen the origmsg var 🤦‍♂️