Parallel datasets - Githubissues

prnk04 commented 3 years ago

Hi, I am trying to create a POC using CodeGen to translate code written in vb to Java and vice-versa. I downloaded the training data for vb and java using Google BigQuery. Also, I have completed the preprocessing step using commands:

python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual_functions --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10
python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

As a result, the following files were created inside the folder XLM-syml:

test.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth
train.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa [0-9]].pth
valid.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

Post that, I trained the MLM model using the following command: python codegen_sources/model/train.py --exp_name mlm_vb_java_fast_mono_updated_v0 --dump_path '/content/Facebook_CodeGen/dumpPath_fast_mono_updated' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --mlm_steps 'vb_sa,java_sa' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 6 --emb_dim 1024 --n_heads 8 --lgs 'vb_sa-java_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --amp 2 --fp16 true --batch_size 16 --bptt 512 --epoch_size 200 --max_epoch 100000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

However, when I am trying to train transcoder model using following command, I am getting AssertionError: /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml/valid.java_sa-vb_sa.java_sa.0.pth error. Command: python codegen_sources/model/train.py --exp_name transcoder_vb_java_updated_v1 --dump_path '/content/drive/MyDrive/dumpPath_updated_transcoder_v0' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --split_data_accross_gpu local --bt_steps 'vb_sa-java_sa-vb_sa,java_sa-vb_sa-java_sa' --ae_steps 'vb_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 6 --n_layers_decoder 6 --emb_dim 1024 --n_heads 8 --lgs 'java_sa-vb_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --reload_model '/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth,/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth' --reload_encoder_for_decoder true --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 100 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --has_sentences_ids true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_vb_-java_mt_comp_acc' --lgs_mapping 'vb_sa:vb,java_sa:java'

Could you please help me as to how do I get these parallel datasets? Also, is there something/some step that I am missing or doing incorrectly?

baptisteroziere commented 3 years ago

Hi. There is no easy way to get a parallel dataset. For TransCoder, we created one by extracting parallel functions from GeeksForGeeks and for vb <-> Java you'd need to find parallel functions somewhere or to create a parallel dataset yourself by translating some functions by hand. Parallel datasets are only needed for evaluating the model automatically, if you want to train a model without running the evaluations, you can remove the pairs in bt steps from required_para here https://github.com/facebookresearch/CodeGen/blob/002710c985f0a691a1d01f141dca34c3e24f2dc1/codegen_sources/model/src/data/loader.py#L500 and stop evaluating on these pairs by removing this line in the evaluator: https://github.com/facebookresearch/CodeGen/blob/002710c985f0a691a1d01f141dca34c3e24f2dc1/codegen_sources/model/src/evaluation/evaluator.py#L283

You should be able to get a working model this way but you won't be able to select the best checkpoint easily and you won't get a translation score automatically.

prnk04 commented 3 years ago

Oh. Okay. Understood. Thank you so much for the help!

prnk04 commented 3 years ago

And just to be clear, the steps that I am doing currently, are fine right? (except for the evaluation one, for which a parallel dataset is required)

baptisteroziere commented 3 years ago

It's mostly fine assuming you are training on 10 GPUs (because --train_splits=10).

There are just a few things you may want to change:

We trained the MLM model on the full monolingual data instead of only the _sa data. It could work as well but if you train this way you only need to create the monolingual_functions dataset. Also you shouldn't set --lgs_mapping 'vb_sa:vb,java_sa:java' since the MLM model was trained on vb_sa,java_sa.
10% for validation and testing seems like a lot if you have millions of training examples. Note that you can modify the n_sentences_eval parameter which limits the number of examples used for validation (1500 by default) to avoid the evaluation taking too long. In practice the variance should be low enough with a few thousands of examples.

prnk04 commented 3 years ago

Thank you for the suggestions. Just for clarification:

Then name of the path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' is a bit misleading. It actually contains data processed in both monolingual as well as monolingual_functions mode.
Since I am using small training data for POC purposes (around 1500 in both the json files), that is why I used 10% for validation and testing.

I hope these things are fine.

baptisteroziere commented 3 years ago

your mlm steps being 'vb_sa,java_sa', you will train the MLM on standalone functions only
I expect you to be able to learn to translate from 1500 monolingual examples in each language. I'm not sure what your goal is exactly but I would try to get bigger monolingual datasets (in the hundreds of thousands of files) to train an unsupervised model.

prnk04 commented 3 years ago

I have changed the MLM step from vb_sa,java_sa to vb_monolingual,java_monolingual, so that should make my MLM model train on a monolingual dataset
I am using smaller files for now, just for POC purposes. For actual training, I will use a bigger dataset.

prnk04 commented 3 years ago

By the way, thank you for your help!

prnk04 commented 3 years ago

Hi, I created a parallel dataset for vb and java and completed the preprocessing. However, the following files were created as symlinks and not cross-language files like 'valid.java_sa-vb_sa.java_sa.0.pth'.

test.java_cl-java_sa.java_cl.pth, train.vb_cl-vb_sa.vb_cl.0.pth, test.java_cl-java_sa.java_sa.pth, train.vb_cl-vb_sa.vb_sa.0.pth, test.vb_cl-vb_sa.vb_cl.pth, valid.java_cl-java_sa.java_cl.pth, test.vb_cl-vb_sa.vb_sa.pth, valid.java_cl-java_sa.java_sa.pth, train.java_cl-java_sa.java_cl.0.pth, valid.vb_cl-vb_sa.vb_cl.pth, train.java_cl-java_sa.java_sa.0.pth, valid.vb_cl-vb_sa.vb_sa.pth

Could you please help me figure out where in the code exactly does the symlinks for crosslingual files are created? I tried to figure it out, but on datasetmode.py, I found the following line of code(https://github.com/facebookresearch/CodeGen/blob/48df64838709e92ed5766dab464386479227f0db/codegen_sources/preprocessing/dataset_modes/dataset_mode.py#L709), which as per my understanding creates monolingual files. create_symlink( f"../{lang}.{split}.{suffix}{self.bpe.ext}.pth", XLM_folder.joinpath( f"{split}.{lang}_{suffix1}-{lang}_{suffix2}.{lang}_{suffix}.pth" ), ) Thanks in advance

baptisteroziere commented 3 years ago

Yes we didn't add an option to add proper symlinks for multilingual datasets in the pipeline. The simplest solution for you would be to run the monolingual pipeline and to add the right symlinks yourself after double checking that the files are really parallel (same number of lines and aligned). If you use the monolingual_functions pipeline, it might be harder to ensure that your files will be parallel (the pipeline won't ensure it). For instance you could call this for lang in [lang1, lang2] (you need lang1 < lang2 so here lang1= java and lang2 = vb). create_symlink( f"../{lang}.{split}.{suffix}{self.bpe.ext}.pth", XLM_folder.joinpath( f"{split}.{lang1}_{suffix1}-{lang2}_{suffix2}.{lang}_{suffix}.pth" ), )

You could also create the symlinks directly with the ln -s command in bash.

prnk04 commented 3 years ago

Okay. So, basically, after ensuring that the files are really parallel, I need to manually create the symlinks. And just to be sure, it'll be like the following: For java:

[train|test|valid].java_sa-vb_cl.java_sa.pth will be a symlink to java.[train|test|valid].sa.bpe.pth
[train|test|valid].java_sa-vb_cl.java_cl.pth will be a symlink to java.[train|test|valid].cl.bpe.pth

For vb:

[train|test|valid].java_sa-vb_cl.vb_sa.pth will be a symlink to vb.[train|test|valid].sa.bpe.pth
[train|test|valid].java_sa-vb_cl.vb_cl.pth will be a symlink to vb.[train|test|valid].cl.bpe.pth

Am I right?

baptisteroziere commented 3 years ago

Well _sa stands for standalone and _cl for class methods so you probably won't translate between java_sa and cb_cl.
I would expect something like this:

[train|test|valid].java_sa-vb_sa.java_sa.pth will be a symlink to java.[train|test|valid].sa.bpe.pth
[train|test|valid].java_cl-vb_cl.java_cl.pth will be a symlink to java.[train|test|valid].cl.bpe.pth
[train|test|valid].java_sa-vb_sa.vb_sa.pth will be a symlink to vb.[train|test|valid].sa.bpe.pth
[train|test|valid].java_cl-vb_cl.vb_cl.pth will be a symlink to vb.[train|test|valid].cl.bpe.pth

By the way, if you have enough parallel data to have a parallel train set you should also add mt_steps for java_sa-vb_sa,vb_sa-java_sa and same thing for class methods. If you only use back-translation and denoising auto-encoding the training steps will ignore all your train.java_sa-vb_sa.lang.pth files.

prnk04 commented 3 years ago

Oh okay. Understood. Thank you so much for the guidance!

prnk04 commented 2 years ago

Hi! The vocab for monolingual (standalone)functions and the parallel dataset (standalone functions) should be the same. Right? Also, should the data from the parallel dataset be included in the preprocessing of monolingual functions?

prnk04 commented 2 years ago

Hi! Another question. Why do we have the following condition? https://github.com/facebookresearch/CodeGen/blob/48df64838709e92ed5766dab464386479227f0db/codegen_sources/preprocessing/dataset_modes/dataset_mode.py#L250

Aren't there such cases where an entire program might have all standalone or all class functions, and not both? In such cases, the number of errors will increase significantly, thus reducing the size of our data.

baptisteroziere commented 2 years ago

Hi Yes the vocab and BPE codes should be the same for all the languages/types of inputs you are training on. You can include the data from the parallel dataset in the preprocessing of the monolingual functions if you want to, but it's probably fine if you learn the BPE codes on the monolingual dataset and just give the right BPE codes and vocab path to the preprocessing pipeline (those learned for the monolingual dataset) when preprocessing your parallel data. It's probably fine because there shouldn't be many tokens that are often present in the parallel dataset and not in the monolingual dataset.

The monolingual_func dataset is not parallel and this line won't be executed. It should be executed when the dataset is parallel, for instance for DOBF where we create a parallel obfuscated code // corresponding dictionary dataset. In that case, the parallel sample is not valid if any of the elements is not present. Does that make sense?

prnk04 commented 2 years ago

So, will it be fine if I perform the following steps?
- Preprocess the data in monolingual mode(for MLM)
- Preprocess the data in the monolingual function mode
- Preprocess the parallel codes in the monolingual function mode and use the vocab and codes generated in step 2 as arguments for preprocessing
Yeah. It sort of does. So, for the parallel dataset, all I need to do is, preprocess them using the pipeline and create symlinks after making sure that the resultant files are truly parallel. Right?

baptisteroziere commented 2 years ago

1) use the vocab and codes of step 1 (preprocessing in monolingual mode) for step 2 and 3. 2) yes

prnk04 commented 2 years ago

Oh Okay. Thank you so much:)

prnk04 commented 2 years ago

Hi! I preprocessed the data as mentioned above. I first preprocessed the data in monolingual mode and then used the vocab and codes as generated by this step to preprocess the same data in monolingual_functions mode and later to preprocess the parallel codes. Also, I made sure that the vb.valid.sa.bpe and java.valid.sa.bpe have parallel data. Also, vb.test.sa.bpe and java.test.sa.bpe are parallel files. Post this, I created the symlinks for parallel data as mentioned above. Post-training the MLM model, I tried to train the transcoder model but I got an error while loading Evaluator at line https://github.com/facebookresearch/CodeGen/blob/9720a8bdba18552cc499975e1f1fb6a7eca74612/codegen_sources/model/src/data/dataset.py#L208

errorLog

As per my understanding, here we are trying to create batches of sentences using the index of '|' from the loaded binarized data(which the the bpe file?). And since the code couldn't find the location of this index in the sentences, hence this error occurred. I checked the bpe files and they don't contain the data in the form of |. Rather it contains only the tokenized functions. So, what is the need to splitting sentences based on the position of '|'? Or is it trying to create batches from the tok files? So, could you please help me figuring out the root of this error?

baptisteroziere commented 2 years ago

If you are evaluating with unit tests organized like ours, you need to add a function ID at the beginning of the test and validation sequences (we use it in the evaluation code to find the evaluation script corresponding to the sample from the valid/test set). The format is ID | code.

If you are not testing with unit tests, you can just set --has_sentence_id false and --eval_computation false and you won't need to have sentence ids in your eval/test sets.

prnk04 commented 2 years ago

Okay. Thank you!

dineshkh commented 2 years ago

Hi @prnk04 @brozi ,

Just to be sure is it correct that test.java_cl-java_sa.java_cl.pth file is nothing but a symbolic link to java.test.cl.bpe.pth file ?

facebookresearch / CodeGen

Parallel datasets #40