hjc3613 commented 2 years ago

train on 30000000 parallel english and chinese sentences, after trained 100000 steps, the predit result:

content of config.yaml:

wmt14_en_de.yaml

save_data: data/wmt/run/example

Corpus opts:

data: corpus_1: path_src: corpus/corpus_1/en-zh.en.bpe path_tgt: corpus/corpus_1/en-zh.zh.jieba.bpe transforms: [filtertoolong] corpus_2: path_src: corpus/corpus_2/en-zh.en.bpe path_tgt: corpus/corpus_2/en-zh.zh.jieba.bpe transforms: [filtertoolong]
valid: path_src: corpus/valid/en-zh-dev.en.bpe path_tgt: corpus/valid/en-zh-dev.zh.jieba.bpe

transforms: [sentencepiece]

subword_nbest: 1 subword_alpha: 0.0

Filter

src_seq_length: 150 tgt_seq_length: 150

silently ignore empty lines in the data

skip_empty_level: silent

Vocab opts

src_vocab: corpus/spm/en-zh.vocab.en tgt_vocab: corpus/spm/en-zh.vocab.zh src_vocab_size: 32000 tgt_vocab_size: 32000 vocab_size_multiple: 8 src_words_min_frequency: 1 tgt_words_min_frequency: 1 share_vocab: False

Model training parameters

General opts

save_model: corpus/model keep_checkpoint: 50 save_checkpoint_steps: 5000 average_decay: 0.0005 seed: 1234 report_every: 100 train_steps: 200000 valid_steps: 5000 train_from: corpus/model_step_100000.pt

Batching

queue_size: 1024 bucket_size: 32768 pool_factor: 8192 world_size: 1 gpu_ranks: [0] batch_type: "tokens" batch_size: 4096 valid_batch_size: 16 batch_size_multiple: 1 max_generator_batches: 0 accum_count: [3] accum_steps: [0]

Optimization

model_dtype: "fp32" optim: "adam" learning_rate: 2 warmup_steps: 6000 decay_method: "noam" adam_beta2: 0.998 max_grad_norm: 0 label_smoothing: 0.1 param_init: 0 param_init_glorot: true normalization: "tokens"

Model

encoder_type: transformer decoder_type: transformer enc_layers: 6 dec_layers: 6 heads: 8 rnn_size: 512 word_vec_size: 512 transformer_ff: 2048 dropout_steps: [0] dropout: [0.1] attention_dropout: [0.1] share_decoder_embeddings: true share_embeddings: false position_encoding: true

note: when the 100000 train steps finished, the acc is only 50, follow is the training file content:

vince62s commented 2 years ago

if accuracy does not go beyond 50 likelihood is that somehting is wrong with your data (eg misaligned corpus) do some checks at various points in your dataset.

JOHW85 commented 2 years ago

It's very likely to do with this:

src_vocab: corpus/spm/en-zh.vocab.en
tgt_vocab: corpus/spm/en-zh.vocab.zh

If the vocab is not done in the same tokenization method as your training files, this will cause problems.