How to use model for training on custom dataset

kr-sundaram commented 4 years ago

Thanks for making the repo public!

I want to use your repository to develop a machine translation model for both EN to DE and DE to EN.

But I am not sure how to use your repository as you have added many features in the repository which is really appreciable.

Could you please let me know how to use your repository for preprocessing, training, and evaluation?

RachitBansal commented 4 years ago

Hey @kr-sundaram,

You can choose any method (transformer, backtranslation-onmt, XLM) you want to execute by going to the translation folder, you might have to make a couple of changes to adapt the scripts according to EN<-->DE, these are:

Specify the paths according to your custom data
Choose the source (src) and target (tgt) languages accordingly.

Please refer to the README files in the subfolders to see how you can use the method for end-to-end NMT.

Additionally, I would also recommend you to have a look at the fairseq repository.

kr-sundaram commented 4 years ago

Thank you very much!

I was going through the code was thing about using backtranslation-onmt for my purpose as i have around 1M parallel sentence of En-De and around 6M for De alone. So i thought it will be better to go with the back-translation.

I just make sure what i understood from these repositories in order to have a robust NMT model, please correct me if i am wrong:

Get the cleaned data tokenized with either of the BBPE or BPE (sentencepiece).
Train the model for the parallel dataset i.e. En-De. (I am not sure how you have done for yours En-Sum dataset as back-translation requires some initial trained model, and in the directory of backtranslation-onmt, you have simply got the trained model from amazonaws. Please clarify how you have trained and evaluated yours initial model)
Divide the monolingual dataset into shards, and run the 'btPipeline.sh' in order to get the translation using trained model, stack it up with that particular shard, train the initial model, and again take the next shard and do the same until translation of all the shards is finished and model gets trained with.

Kindly let know how you trained your initial model and what do you think which tokenizer will be best in my case?

One more thing, can you please tell me which translation technique (vanilla-transformer, backtranslation, XLM or MASS) gave you the best result for you case?

RachitBansal commented 4 years ago

The currently pipeline for BT and vanilla Transformer is making use of SentencePiece as that is what gave us the best results with vanilla-transformer
Those are not amazon aws models, rather are just the trained weights/checkpoints of the models we have trained (stored in AWS S3 Buckets). In case of backtranslation, currently the inital model is the vanilla transformer, which was obtained by using the 'transformer' folder to train on our data. You can use any En-De model for your use case or train one using 'transformer', you'll just have to store the inital weights as mentioned and specify the path correctly while running the btPipeline.

Back Translation has given us the best results until now, but this project is still in very active progress and we expect to achieve even better result in the coming days (specially using semi-supervised approaches with XLM and MASS).

Thanks for your interest.

kr-sundaram commented 4 years ago

Hi @RachitBansal

I am sorry about all my silly questions but as i am new to this i need to understand. I have few queries as below,

Are the weights you talking about are the trained vanilla transformer model? So simply, in the btPipeline.sh file, in the 'weightsDir' variable, we have to simply provide the path of the trained transformer model on bilingual data?
Whether MassUnsupervised model has developed fully and worked good on your dataset? If yes, then which gave you better result backtranslation or MassUnsup method?
I got bit confusion as there are two folders inside the 'translation' folder (https://github.com/cdli-gh/Unsupervised-NMT-for-Sumerian-English/tree/master/translation). One is backtranslation-onmt and other is backtranslation. 'backtranslation-onmt' uses OpenNMT repo and 'backtranslation' uses Fairseq repo. Which repo should i go with respect to faster convergence and accuracy of the tranind model?
Below is the sample code from btPipeline.sh file, could you please explain what is the significance of using the term 'HEAD' and '9852ff06c444ff221e4577815c7dcac64a41a054'? and what it should be for my use case?

<<<<<<< HEAD
...
...
>>>>>>> 9852ff06c444ff221e4577815c7dcac64a41a054

Thank you very much for helping me out!

RachitBansal commented 4 years ago

Hi @kr-sundaram,

Yes, if you have pre-trained transformer model weights then save them inside {weightsDir}/0st with the name _step_10000.pt. So, in the weightsDir variable you have to put the path to directory which contains 0st/_step_10000.pt
Back Translation gave better performance as of now, but that may be very different for your data
Go for backtranslation-onmt, the fairseq one is still under development while the onmt one is ready and easy to use.
I have updated the file.

kr-sundaram commented 4 years ago

Thanks for your kind help!

I found that backtranslateONMT.py file, also the same term 'HEAD' and '9852ff06c444ff221e4577815c7dcac64a41a054'. Kindly delete them if they are in use.

<<<<<<< HEAD
...
...
>>>>>>> 9852ff06c444ff221e4577815c7dcac64a41a054

Just one question about the MASS!

I have one confusion regarding the train, test and valid file in mono and para directory for NMT task for model pre-training and fine-tuning tasks.

I understand dict.en.txt and dict.sum.txtshould be exact same in both mono and para directory. And in para directory bilingual data should be there in order to fine-tune the model for fine-tune task. The confusion i have is basically for mono directory and number of examples it should contain for both the languages in their respective train, test and valid files.

Whether number of sentences and the sentences itself in both languages can differ for mono directory, right? I mean it should not matter if one uses, lets say, 100 sentences for en and 200 sentences for sum as they are just bunch of monolingual data.

The only point to note that is, both mono and para directory should share same dictionary files, right?

RachitBansal commented 4 years ago

Yes, you are absolutely right. The mono files for both languages need not contain the same number of sentences at all, it may vary according to the data availability. You only need to keep 2 things in mind:

All the parallel data should contain proper line-by-line bitext. The number of lines should, thus, also be equal.
The dictionary files should remain the same for mono and para.

RachitBansal commented 4 years ago

Feel free to re-open if you have any further queries.

cdli-gh / Semi-Supervised-NMT-for-Sumerian-English

How to use model for training on custom dataset #16