How to get parallel data to train AE and BT

facebookresearch / TransCoder

Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf

Other

1.69k stars 258 forks source link

How to get parallel data to train AE and BT #22

Open keai007 opened 3 years ago

keai007 commented 3 years ago

Hi, I try to preprocess another programming language to train my new model. But I cannot figure out how to get parallel data when trainning AE & BT,eg test.python_sa-cpp_sa.pth. I'll appreciate it very much if you could help me.

orazheng commented 3 years ago

Hi @keai007 , I try to preprocess another programming language as well. I was wondering if you figure out how to get parallel data? Thanks!

keai007 commented 3 years ago

Hi, You should take a look at https://github.com/facebookresearch/XLM#1-preparing-the-data-1 .Transcoder is based on XLM, and that repo contains much more clear tutorials and meaningful discussions. It helped me a lot, and hope can help you too.

orazheng commented 3 years ago

@keai007 Thank you very much for sharing!

vthost commented 3 years ago

@keai007 (or anyone of the authors!) I looked at the get-data-para.sh script in the other repository. From preprocessing, we have already tokenized train/valid/test sets, with BPE applied and binarized. Do we just have to duplicate and rename those test/valid files?

perlconverter commented 3 years ago

Hi, I try to preprocess another programming language to train my new model. But I cannot figure out how to get parallel data when trainning AE & BT,eg test.python_sa-cpp_sa.pth. I'll appreciate it very much if you could help me.

HELP

perlconverter commented 3 years ago

HELP

perlconverter commented 3 years ago

HELP perlconverter@gmail.com

perlconverter commented 3 years ago

HELP from INDIA

himanshu034 commented 3 years ago

Does anyone find the process to generate parallel dataset to be used in the training process with AE & BT ? Any help will be much appreciated.