cindyxinyiwang / multiDDS

Code for the paper "Balancing Training for Multilingual Neural Machine Translation, ACL 2020"
MIT License
23 stars 9 forks source link

Some code fix and questions about the modifications on fairseq source code #1

Closed steventan0110 closed 4 years ago

steventan0110 commented 4 years ago

Hello, I'm Steven, from Johns Hopkins University. I'm currently working on a research project, studying different methods to denoise the training data for low resource languages. I came across your papers (DDS, TCS, and multiDDS) and I'm very interested in your implementation. I start checking this code repo very carefully and I found some issues (I sort of "fixed" them in my own way in a forked repo, if you think it's useful to incorporate in your repo, I can submit a pull request for you to review my changes). Here are the issues:

fairseq beamsearch is out of date:

the code in fairseq/seach.py (torch.div) is deprecated so I update them using the most recent fairseq's beamsearch code.

undefined variable in trainer.py/update_language_sampler()

I think this is the most important part of the code since you calculated the gradient similarity between the training set and dev set to get the similarity score to update the language distribution. There are some undefined or unused variables like self.optimizers, all_sim_list. I changed them so that the code only use one vector sim_list though theoretically there should be a N*N (N is number of language pairs) sim_list, and that's why you need all_sim_list to append different sim_list right? My change only helps me to run my own code since I'm using just 1pair of language instead of multilingual settings, but I think it shouldn't be hard to fix it, you might just leave those variables there by accident.

generator is not reporting score properly

It seems that if I use --sacrebleu to generate, the result is not a string but | Generate test with beam=5: <sacrebleu.metrics.bleu.BLEUScore object at 0x7fec308a75b0> I'm not what causes the object to be printed.

The code is not working with Ave type data_actor

Since I'm more interested in a one-pair setting instead of multilingual input, I want the scorer to directly work on src_tokens and trg_tokens, which is the method you proposed in the DDS paper. If I interpret your code correctly, this block should never be run right?

    # data selection: reset epoch iter to filter out unselected data
    if epoch_itr.epoch == args.select_by_dds_epoch and args.select_by_dds_epoch > 0:
        epoch_itr, _ = trainer.get_filtered_train_iterator(epoch_itr.epoch, filtered_maxpos_indices=filtered_maxpos_indices)

Since I want to work with data-filtering, and I realize base data-actor is only seeing language IDs instead of real tokens, I have to useave type. To make it work, I changed your initialization steps (basically I added elif self.args.data_actor == 'ave': and adam optimizer for it in your trainer.py). I'm not sure if this modification is correct but select_by_dds_epoch works after this change. Therefore, I just want some confirmation/help from you that this is indeed the correct way to implement a data-filtering with ave data actor.

Last question

I'm just curious what is the usage --utility-type in the args. I didn't find where it's triggered when I debug through my console. Also, could you share with me the training script/hyper parameters you use for DDS (Optimizing Data Usage via Differentiable Rewards) since I want to train directly on 1 pair of languages and replicate your result.

I'm really impressed by how well you modified the fairseq toolkit and incorporated the reinforcement optimization to change the data loading. If I have any misunderstanding about your methods or code implementation, please let me know. Also, please let me know that if you want me to submit a pull request for you to better view my changes. Thank you for your help in advance!

cindyxinyiwang commented 4 years ago

Hey,

Thanks so much for your fix! That's very helpful.

  1. unused variable: I think these are variables I used to print out some analysis statistics.

  2. data filtering: You are right this section of code is never called. It's implemented a while back to test some ideas we had, but i didn't finish implementing it. I'm not sure if that's what you want, but I didn't use this code for experiments in the paper.

  3. utility-type: It's set to use either the lowest performing languages as dev set or highest performing languages for multiDDS.

  4. train on one pair of language: If you want to train on one pair of language and still using multilingual training data, an easy thing to do is to set the --eval-lang-pairs to just the one language you care about (that's the setting in DDS paper). I haven't used a single training language pair and score each training example. What you are describing sounds more like the image classification experiments in DDS paper, and we are working on releasing the code soon.

Would be great if you submit your changes! I will include the fix in the repo.

steventan0110 commented 4 years ago

Hi,

Thank you for your clarification, I just submitted a pull request.

Just to clarify what I was trying to test, I think my idea is most similar to your script ces_eng.sh, where the training data is only one pair, not multilingual. The code is shown as below:

CUDA_VISIBLE_DEVICES=$1 python train.py data-bin/ted_ces/ \
      --task multilingual_translation \
      --arch multilingual_transformer_iwslt_de_en \
      --max-epoch 50 \
    --dataset-type "multi" \
    --lang-pairs "ces-eng" \
      --no-epoch-checkpoints \
      --distributed-world-size 1 \
      --share-all-embeddings --share-decoders --share-encoders \
      --dropout 0.3 --attention-dropout 0.3 --relu-dropout 0.3 --weight-decay 0.0 \
      --left-pad-source 'True' --left-pad-target 'False' \
      --optimizer 'adam' --adam-betas '(0.9, 0.98)' --lr-scheduler 'inverse_sqrt_decay' \
      --warmup-init-lr 1e-7 --warmup-updates 4000 --lr 2e-4 --lr-shrink 0.8 \
      --criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
      --max-tokens 9600 \
      --seed 2 \
      --max-source-positions 150 --max-target-positions 150 \
      --save-dir $MODEL_DIR \
          --encoder-normalize-before --decoder-normalize-before \
          --scale-norm \
      --log-interval 100 >> $MODEL_DIR/train.log 2>&1
  #--utility-type 'ave' \
  #--data-actor 'ave_emb' \
  #--data-actor-multilin \
  #--update-language-sampling 2 \
  #--data-actor-model-embed  1 \
  #--data-actor-embed-grad 0 \
  #--out-score-type 'sigmoid' \
    #--log-interval 1 
  #--sample-instance \

I don't know if you ever run this script cause you said "I haven't used a single training language pair and score each training example" but this script would be close to what I want to try. If I understand correctly, the update-language-sampling functions is to update the distribution of training languages. So if we have 8 language pairs, the data-actor to predict a distribution vector of size 8 and new training batch would draw data from each language according to this distribution.

What I was trying to do is to feed into data-actor src_tokens and trg_tokens for one batch (or certain number of sentences like 128). The data-actor would rank them and use that rank to filter out 0.2 percentage of the training data. I think this is what you are trying to do with trainer.get_filtered_train_iterator? Can you let me know which part of it is not finished? My guess is that you didn't implement data-actor update for this function? (I found that data_actor_optimizer update step is only in the update-language-sampling function or the pretrain-data-actor function)

cindyxinyiwang commented 4 years ago

Hi, 1. The script you copied does not run with DDS or MultiDDS - it's a script to run training on parallel data from one language. To use multiple training data, you need to change --lang-pairs to the multiple languages you use. 2. That sounds like what I was trying to do with the function you pointed out. But as i said before, i didn't end up pursuing this direction so didn't finish implementing the data actor update. I didn't test it extensively so I'm not super sure if the implementation is correct though. Hope it helps!

steventan0110 commented 4 years ago

Hi,

Thank you for your quick reply,

  1. I know it's not using DDS or multiDDS because you comment out data-actor so it's just a simple MT with transformer. I'm trying to include data-actor (ace_embed_actor, lstm_actor, or transformer_actor) into this setting (only one pair of language).

I understand that you didn't try to use the data-filter approach and I might implement that approach based on your current code and see how it works. Thank you so much for the clarification!

cindyxinyiwang commented 4 years ago

Thanks for helping with fixing the bug in the repo! I will try to incorporate the fix. Good luck and let me know if you have more questions!