Closed loretoparisi closed 6 years ago
Thank you for your feedback.
This project is not in active development now and we do not have immediate plans to resume work on this.
On the other hand, I'd like to be clear that most probably we will never consider development in Tensorflow :) We prefer Keras and PyTorch
@Hrant-Khachatrian thanks it makes sense, by the way it's a great project! closing...
@loretoparisi By the way, with one collaborator I am working on https://github.com/deeplanguageclass/fairseq-transliteration for translit by modifying the FAIR PyTorch seq2seq impl to char-level.
For pedagogic purposes I made it work on Jupyter or Google Colab which has a GPU option: https://github.com/deeplanguageclass/fairseq-transliteration.ipynb.
@bittlingmayer @Hrant-Khachatrian Thank you guys!. My aim is to generalize the transliteration from japanese and some indian languages. The problem so far to adapt this repo to another language. For japanese I have the model from kakasi, but I do not have the rules in the format needed by translit-rnn, while for the indian languages (like Hindi, Tamil, Telegu, etc.) I do not have the rules at all. @bittlingmayer I can see that the repo is a work in progress, and the docs it's not specific for the transliterator, could you point me to a working example to start the training of a new vocabulary?
@loretoparisi We did not change the documentation in the forked repo yet, but there is documentation here: https://deeplanguageclass.github.io/lab/fairseq-transliteration/.
It has links to an iPython Notebook in the same GitHub project. It ran on Google Colab, which is good for students because it has a GPU, but I do not recommend it for more than a toy.
The real pain is to get the data, and in the right format.
So we start with rows of monolingual target-language data, and use the fairseq-transliteration-data/generate.js to generate the source side, same as here in this repo.
Then we have the simple human-readable parallel files, but Fairseq can't use them in that format.
So then there are all those steps with the byte-pair encodings before we can even start training.
I would love to see those steps reduced, given parallel data it really should be just a one-liner, like fastText is.
Also note that we should probably use reverse=True. (More on that: https://www.reddit.com/r/LanguageTechnology/comments/9cks6n/intuition_on_the_trick_of_reversing_input/.)
@bittlingmayer thanks this is a good starting point, also I think it's a great approach using a seq2seq modeling. I'm recently approached to indic-trans, that has a great coverage of indian languages, on which I'm working on right now. As you said, transliteration rules are the basis, and the availability of the parallel corpus files as well. I have also just published a simple Dockerfile for this framework here, where I have also a standard fairseq version. At some point I would like to use one architecture for all languages, rather than different approaches (kakasi, pinyin, indic-trans, g2p-seq2seq, translit-rnn, etc.), but it's seems to be a very hard task!
Maybe, providing an example for a different transliteration rule could help to use your fairseq customization, like for japanese transliteration rules used by NodeNatural here.
Some questions. Which different do you see between yours implementation and a Tensorflow T2T seq2seq approach (like in the g2p-seq2seq project above)? Also, I can see you are using BPE to pre-process the corpus, that is typically useful for translation tasks (FairSeq, OpenNMT) but for classification as well (as a tokenization approach). Did you tried any (simpler) alternative?
Right, I think one approach to all languages is the right way, and the tech to do it is already here. Actually I think the approach varies more by domain than by language, transliteration styles and conventions about what to leave in the original alphabet or what to translate are subtly different across domains.
Parallel corpora would be ideal especially for validating but requiring them will always be a great burden.
Which different do you see between yours implementation and a Tensorflow T2T seq2seq approach (like in the g2p-seq2seq project above)?
I don't know about that approach, I can just say what I know about Fairseq: uses BPE, does pre-proc for punct etc (but probably shouldn't), has attention, GPU-only, input not reversed (but could be and probably should be). Also it supports checkpointing and restarting and just trains until the loss is not dropping significantly anymore, not a fixed number of epochs. Then there is our mod from word-level to char-level, and, necessitated by that, mod to increase the row max width (seq max length).
For teaching purposes, we tried it with only 1M rows, but we have 30M or so, and the process to make much more, for many languages. In my opinion the approach has not been properly tried yet until we 1) stop pre-processing 2) reverse input 3) increase dataset size and train with production params 4) do more realistic generation with a bit more randomness.
Did you tried any (simpler) alternative?
Yes, we tried https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html, the results were not good. And there is TranslitRNN.
By the way, as far as I can tell, that Japanese mapping is for generating a single formal Romanisation from proper Japanese.
But the mapping here in TranslitRNN and in our Fairseq impl is for generating diverse informal Romanisations.
https://en.wikipedia.org/wiki/Informal_romanizations_of_Cyrillic#Translit, https://en.wikipedia.org/wiki/Greeklish, https://en.wikipedia.org/wiki/Arabic_chat_alphabet, https://en.wikipedia.org/wiki/Romanization_of_Bulgarian#Informal_writing, https://en.wikipedia.org/wiki/Romanization_of_Persian#ASCII_Internet_romanizations.
They are lossy AND inconsistent, a sort of "one-way function", so the problem is much harder than for a specific known scientific transcription.
@bittlingmayer yes. For Japanese the best solution I have found and that is good production is the good old kakasi at this time. The japanese dictionaries where derived form the SKK dictionaries available here. While for indic languages I'm trying with very good results the indic-trans, that was presented at the Google Summer Code 2016 - see here, and that has a paper behind, that is IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search
here.
I'm now using the pre-trained models, but there should be available the training set as well for supported indic langs they provide.
Maybe a good starting point would be to collect all transliteration resources (dictinaries, parallel corpus, rules, current statistical and nn models, papers, etc.) somewhere. I'm not aware of something like this.
+ @osoblanco
You are welcome to do it under http://nlpguide.github.io/transliteration. I made the repo, invited you as admin and enabled GitHub Pages.
I also recently took over the orphaned r/machinetranslation. Anything related to the transliteration task is fine there, it's a sort of a subset of the translation task, they are very entangled.
These weeks I'm a bit busy but I have a lot of thoughts about this, new approaches and where it can go, especially around training "languageless" / "universal" models that do more than one pair or even all pairs.
@bittlingmayer great idea, a lot of things in the boilerplate 👍
That's a pretty interesting project. I'm currently using C
Kakasi
for Kanji to Romaji Japanese transliteration, so I did a wrapper to Node.js for that - Kakasi.js and mirrored to sources here for a possible enhancement. For sure a model would perform better fo out of vocabulary words.I'm also working on a Grapheme to Phoneme Encoder/decoder model - g2p-seq2seq, using Tensorflow Tensor2Tensor Transformers derived from CMU G2P-Sequ2Seq project.
I think that this very complex model (based on a bi-LSTM) could work in this case as well. I will strongly suggest to move to Tensorflow + Tensor2Tensor for a more modern architectural design of the neural network model with encoder/decoder plus attention.