Got strange result while training translation from zh to en.

EuphoriaYan commented 3 years ago

Hi,

I'm trying using your "langvar" branch to translate from Chinese to English, but I got strange statistic and result.

Statistic:

[2021-09-01 14:16:18,792 INFO] Step 50/150000; acc:  83.52; oth_acc:   0.00; ppl:  0.00; xent: -427.01; lr: 0.00015; 2888/4087 tok/s;     90 sec
[2021-09-01 14:16:39,127 INFO] Step 100/150000; acc:  58.23; oth_acc:   0.00; ppl:  0.00; xent: -427.74; lr: 0.00030; 12860/18451 tok/s;    110 sec
[2021-09-01 14:16:59,991 INFO] Step 150/150000; acc:  55.61; oth_acc:   0.00; ppl:  0.00; xent: -427.82; lr: 0.00030; 12801/17890 tok/s;    131 sec
[2021-09-01 14:17:21,238 INFO] Step 200/150000; acc:  55.08; oth_acc:   0.00; ppl:  0.00; xent: -427.84; lr: 0.00030; 12558/17354 tok/s;    152 sec
[2021-09-01 14:17:42,579 INFO] Step 250/150000; acc:  54.64; oth_acc:   0.00; ppl:  0.00; xent: -427.86; lr: 0.00030; 12691/17512 tok/s;    174 sec

As you can see, the acc is decreasing and the perplexity is always zero.

When I use trained model to translate, it will always translate every Chinese token into "the".

Below is my own training process: First, use moses scripts to do tokenization and truecase.

/path/to/moses/scripts/tokenizer/tokenizer.perl -l zh -a -no-escape -threads 20 < train.zh > train.tok.zh
/path/to/moses/scripts/tokenizer/tokenizer.perl -l en -a -no-escape -threads 20 < train.en > train.tok.en
#repeat similar steps for tokenizing val and test sets

/path/to/moses/scripts/recaser/train-truecaser.perl --model truecaser.model.zh --corpus train.tok.zh
/path/to/moses/scripts/recaser/train-truecaser.perl --model truecaser.model.en --corpus train.tok.en

/path/to/moses/scripts/recaser/truecase.perl --model truecaser.model.zh < train.tok.zh > train.tok.true.zh
/path/to/moses/scripts/recaser/truecase.perl --model truecaser.model.en < train.tok.en > train.tok.true.en
#repeat similar steps for truecasing val and test sets (using the same truecasing model learnt from train)

Second, use fastBPE to tokenize.

# learning BPE codes
/path/to/fastBPE/fast learnbpe 24000 train.tok.true.zh > zh.bpecodes
/path/to/fastBPE/fast learnbpe 24000 train.tok.true.en > en.bpecodes

# applying BPE codes
/path/to/fastBPE/fast applybpe train.zh.bpetok train.tok.true.zh zh.bpecodes 
/path/to/fastBPE/fast applybpe train.en.bpetok train.tok.true.en en.bpecodes 

# get vocab
/path/to/fastBPE/fast getvocab train.zh.bpetok > vocab.zh
/path/to/fastBPE/fast getvocab train.en.bpetok > vocab.en

#repeat similar steps for tokenizing the target monolingual corpus, validation set, and test set, using the vocab.lang.
/path/to/fastBPE/fast applybpe valid.zh.bpetok valid.tok.true.zh zh.bpecodes vocab.zh
/path/to/fastBPE/fast applybpe valid.en.bpetok valid.tok.true.en en.bpecodes vocab.en

Third, train fasttext using hyperpara mentioned in #11 .

./fasttext skipgram -input valid.en.bpetok -output emb/en -dim 300 -thread 8

Fourth, use preprocess.py to binary, using src_vocab and tgt_vocab is a difference.

python preprocess.py 
-train_src train.zh.bpetok
-train_tgt train.en.bpetok
-valid_src valid.zh.bpetok
-valid_tgt valid.en.bpetok
-save_data bin
-tgt_emb emb/en.vec
-src_vocab vocab.zh
-tgt_vocab vocab.en
-src_seq_length 175
-tgt_seq_length 175
-num_threads 32
-overwrite

Finally, use train.py to train, using the same hyperpara from README.

python train.py
-accum_count 2
-adam_beta2 0.9995
-batch_size 4000
-batch_type tokens
-decay_method linear
-decoder_type transformer
-dropout 0.1
-encoder_type transformer
-generator_function continuous-linear
-generator_layer_norm
-heads 4
-label_smoothing 0.1
-lambda_vmf 0.2
-layers 6
-learning_rate 1
-loss nllvmf
-max_generator_batches 2
-max_grad_norm 5
-min_lr 1e-09
-normalization tokens
-optim radam
-param_init 0
-param_init_glorot
-position_encoding
-rnn_size 512
-save_checkpoint_steps 5000
-share_decoder_embeddings
-train_steps 150000
-transformer_ff 1024
-valid_batch_size 4
-valid_steps 5000
-warmup_end_lr 0.0007
-warmup_init_lr 1e-08
-warmup_steps 1
-weight_decay 1e-05
-word_vec_size 512
-world_size 1
-data bin
-save_model ckpts/
-gpu_ranks 0

I want to know if there are some mistakes in my training process, your response will be appreciated!

Thank you!

EuphoriaYan commented 3 years ago

Well, I found that during training, - logcmk(kappa) is always ~ -420 and never change. torch.log(1 + kappa) * (self.lambda_vmf - (output_emb_unitnorm * target_emb_unitnorm).sum(dim=-1)) is decreasing from ~ 0.5. Is it abnormal?

EuphoriaYan commented 3 years ago

I tried using -approximate_vmf in args, found that logcmkappox(kappa, emb_size) is always ~ -690 and never change.

Sachin19 commented 2 years ago

Hi EuphoriaYan,

Apologies for such a long delay in my reply.

As you can see, the acc is decreasing and the perplexity is always zero.

Sorry, the statistics are not named correctly. They are named according to softmax-based models. "acc" here means "cosine distance", and x-ent means vMF loss. Perplexity is computed on top of the reported vMF loss which is 0 because vMF values are highly negative (so it's sort of meaningless). The only two losses worth monitoring here are "acc" and "x-ent" which by the trend looks find since they both should be decreasing. Also if you could let me know your final validation loss on this training set, I can judge if the model trained well or not. With good token embeddings, a cosine (acc) value of less than around 0.25 usually results in decent MT performance (for English).

./fasttext skipgram -input valid.en.bpetok -output emb/en -dim 300 -thread 8

You should train the embeddings on a larger training set, not the validation set. This method needs good quality embeddings to work. If you switch it to train.en.bpetok, you should be able to get better results. The English token embeddings (without BPE) that I used are provided here

/path/to/moses/scripts/tokenizer/tokenizer.perl -l zh -a -no-escape -threads 20 < train.zh > train.tok.zh

Not 100% sure if moses supports Chinese tokenization. This could be an issue.

Hope these suggestions resolve your issues :)

Sachin

Sachin19 / seq2seq-con

Got strange result while training translation from zh to en. #12