facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.48k stars 6.41k forks source link

Applying Subword Regularization to Training(Changing training inputs every epoch) #1137

Closed minstar closed 5 years ago

minstar commented 5 years ago

Hi, I'm currently doing my research in applying "Subword Regularization" to training NMT model, where they sample from segmentation candidates every parameter update. I am trying to apply this method to "IWSLT17" dataset provided in examples/translation.

I noticed that there is bash file which generates "segmented files", and preprocess.py which creates dictionary and bin file. And during training, inputs are fixed. However, I want to change inputs every epoch, (change input's segmentation by sampling).

Is it possible? or any suggestions? In training file, should I decode given segmented inputs into raw text and then do sampling(among segmentation candidates) from it?

Thank you.

huihuifan commented 5 years ago

It's not possible in a very easy way, but you could modify the code that prepares the batches of data to change the segmentation candidates, or read multiple streams of data (one for each segmentation you want) and choose one of those

minstar commented 5 years ago

@huihuifan Thank you! I'll give it a shot.