Closed hjc3613 closed 2 years ago
if accuracy does not go beyond 50 likelihood is that somehting is wrong with your data (eg misaligned corpus) do some checks at various points in your dataset.
It's very likely to do with this:
src_vocab: corpus/spm/en-zh.vocab.en
tgt_vocab: corpus/spm/en-zh.vocab.zh
If the vocab is not done in the same tokenization method as your training files, this will cause problems.
I reprocess the train/valid/test files with spacy tokenization, and using opennmt-tf re-train the data, now the result becames normal . the only problem is tha the ppl on valid dataset down to about 10 after 85000 steps: as far as I know, the ppl should be 1~2, the train dataset is about 2000000 lines, which download from wmt; I don't know what to do with next step
train on 30000000 parallel english and chinese sentences, after trained 100000 steps, the predit result:
content of config.yaml:
wmt14_en_de.yaml
save_data: data/wmt/run/example
Corpus opts:
data: corpus_1: path_src: corpus/corpus_1/en-zh.en.bpe path_tgt: corpus/corpus_1/en-zh.zh.jieba.bpe transforms: [filtertoolong] corpus_2: path_src: corpus/corpus_2/en-zh.en.bpe path_tgt: corpus/corpus_2/en-zh.zh.jieba.bpe transforms: [filtertoolong]
valid: path_src: corpus/valid/en-zh-dev.en.bpe path_tgt: corpus/valid/en-zh-dev.zh.jieba.bpe
transforms: [sentencepiece]
subword_nbest: 1 subword_alpha: 0.0
Filter
src_seq_length: 150 tgt_seq_length: 150
silently ignore empty lines in the data
skip_empty_level: silent
Vocab opts
src_vocab: corpus/spm/en-zh.vocab.en tgt_vocab: corpus/spm/en-zh.vocab.zh src_vocab_size: 32000 tgt_vocab_size: 32000 vocab_size_multiple: 8 src_words_min_frequency: 1 tgt_words_min_frequency: 1 share_vocab: False
Model training parameters
General opts
save_model: corpus/model keep_checkpoint: 50 save_checkpoint_steps: 5000 average_decay: 0.0005 seed: 1234 report_every: 100 train_steps: 200000 valid_steps: 5000 train_from: corpus/model_step_100000.pt
Batching
queue_size: 1024 bucket_size: 32768 pool_factor: 8192 world_size: 1 gpu_ranks: [0] batch_type: "tokens" batch_size: 4096 valid_batch_size: 16 batch_size_multiple: 1 max_generator_batches: 0 accum_count: [3] accum_steps: [0]
Optimization
model_dtype: "fp32" optim: "adam" learning_rate: 2 warmup_steps: 6000 decay_method: "noam" adam_beta2: 0.998 max_grad_norm: 0 label_smoothing: 0.1 param_init: 0 param_init_glorot: true normalization: "tokens"
Model
encoder_type: transformer decoder_type: transformer enc_layers: 6 dec_layers: 6 heads: 8 rnn_size: 512 word_vec_size: 512 transformer_ff: 2048 dropout_steps: [0] dropout: [0.1] attention_dropout: [0.1] share_decoder_embeddings: true share_embeddings: false position_encoding: true
note: when the 100000 train steps finished, the acc is only 50, follow is the training file content: