Natural Language Processing for Language Translation

limazix commented 4 years ago

Is your feature request related to a problem? Please describe. It is not clear if the current method used for language translation is the best approach.

Obs.: Is there have any document or explanation of how the translation has been handle?

Describe the solution you'd like Lately, deep learning techniques are giving excellent results for translation. The most notorious implementation is the Seq2Seq, where it is trained by receiving pairs of sentences from both languages. With the model trained, it will be capable of transforming one sentence from one language to the other one.

Describe alternatives you've considered There are multiple alternatives to implement Seq2Seq:

NLTK - https://www.nltk.org/api/nltk.translate.html
Pytorch + Torchtext - https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html
Pytorch with Attention - https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

The two problems, how to make it available and enough data to have high accuracy.

About models, it's crucial to properly manage it by scheduled training, constant evaluation, and improvement. I do believe that we can use Watson Data Studio to build and deploy, and Watson Machine Learning to expose.

Regarding data, I suggest to open another issue and define strategies to collect it.

Additional context

Infrastructure Overview

filipecorrea commented 4 years ago

@limazix I completely agree with you and we should focus on the problems that you listed to start this implementation. Can you help me checking if there're existing issues covering them and open ones if it's not?

I just opened one for myself to document the existing translation process.

limazix commented 4 years ago

@filipecorrea, is there any translation sample data? I'm finally able to finish the POC, but I need a short-to-medium data sample to train the model.

filipecorrea commented 4 years ago

@limazix, there's a short-to-medium data sample in LIBRAS / PT-BR in https://github.com/IBM/libras/tree/hkbase/data.

Do you need ASL / EN-US? How many sentences? I can ask our collaborators to create that.

limazix commented 4 years ago

@filipecorrea I believe that this dataset will be enough for testing, but which file should I use? How is it organized?

filipecorrea commented 4 years ago

I'll send you a data sample in IBM's Slack.

IBM / libras

Natural Language Processing for Language Translation #28