DeepTrans is a character level language model for transliterating English text into Hindi. It is based on the attention mechanism presented in [1] and its implementation in Tensorflow's Sequence to Sequence models. This project has been inspired by the translation model presented in tensorflow's sequence to sequence model. This project comes with a pretrained model (2 layers with 256 units each) for Hindi but can be easily trained over the existing model or from scratch. The pretrained models are trained on lowercase words. If you wish to train your own model then feel free to do whatever you want and I would be glad if you could share your results and models with me. I hope to see interesting results.
I have tested it on an Ubuntu 15.04 with NVIDIA GeForce GT 740M Graphics card with Tensorflow running in a virtual environment. It should ideally run smoothly on any other system with tensorflow installed in it.
git clone https://github.com/dashayushman/deep-trans.git
python transliterate.py --self_test
This will generate a fake model (2 layers 32 units per layer) with fake data and trains it for 5 steps. If the code returns without any errors, proceed to the next step.
trained_model
|_version_1.0
|_model_12_09_2016.zip
|_model_12_09_2016.tar
|_version_0.1
|_model_9_08_2016.zip
|_model_9_08_2016.tar
vocabulary
|_version_1.0
|_vocab_12_09_2016.zip
|_vocab_12_09_2016.tar
|_version_0.1
|_vocab_9_08_2016.zip
|_vocab_9_08_2016.tar
The pretrained models and vocabularies are versioned with a date attached to the name of the compressed files. Downloading the latest version is recommended. You will find both .tar and .zip files in the download link. Both of them have the same model so you can download any one. Make sure that your model and vocabulary date and version match.
Execute the following command from your commandline to load the pre-trained models and enter an interactive mode where you can input english strings in the standard input and check results there itself.
python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --decode
Your commandline should have something like this
You can enter your 'English word' after the '>' in the command like and hit enter to see results.
Execute the following command from your commandline to load the pre-trained models and transliterate an entire file. Make sure your file contains one english word per line and is named 'test.en'
python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --transliterate_file --transliterate_file_dir <path_to_directory_that_contains_test.en>
If you get a 'done generating the output file!!!' message on your commandline, then you are good to go. You will find a 'results.txt' file in your 'transliterate_file_dir'
Once you have the above files in a directory, execute the following command to start training your own model.
python transliterate.py --data_dir <path_to_directory_with_training_and_development_files> --train_dir <path_to_a_directory_to_save_checkpoints> --size=2<number_units_per_layer> --num_layers=<number_of_layers> --steps_per_checkpoint=<number_of_steps_to_save_a_checkpoint>
The following is a real example of the above,
python transliterate.py --data_dir /home/ayushman/projects/transliterate/train_test_data/ --train_dir /home/ayushman/projects/transliterate/chkpnts/ --size=1024 --num_layers=5 --steps_per_checkpoint=1000
The following is a list of available flags that you can set for changing the model parameters.
FLAG | VALUE TYPE | DEFAULT VALUE | DESCRIPTION |
---|---|---|---|
learning_rate | Float | 0.001 | Learning rate for backpropagation through time. |
learning_rate_decay_factor | Float | 0.99 | Learning rate decays by this much. |
max_gradient_norm | Float | 5.0 | Clip gradients to this norm. |
batch_size | Integer | 10 | Batch size to use during training. |
size | Integer | 256 | Size of each model layer. |
num_layers | Integer | 2 | Number of layers in the model. |
en_vocab_size | Integer | 40000 | English vocabulary size. |
hn_vocab_size | Integer | 40000 | Hindi vocabulary size. |
data_dir | String(path) | /tmp | Data directory |
transliterate_file_dir | String(path) | /tmp | Data directory |
train_dir | String(path) | /tmp | Training directory (to save checkpoints or models). |
max_train_data_size | Integer | 0 | Limit on the size of training data (0: no limit). |
steps_per_checkpoint | Integer | 200 | How many training steps to do per checkpoint. |
decode | Boolean | False | et to True for interactive decoding. |
transliterate_file | Boolean | False | Set to True for transliterating a file. |
self_test | Boolean | False | Run a self-test if this is set to True. |