OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.72k stars 2.25k forks source link

Preprocess:tgt_vocab_size is always four #1961

Closed chens72 closed 3 years ago

chens72 commented 3 years ago

When preprocessing, tgt_vocab_size is always 4, and src_vocab_size is always 2. I don't know what caused this, because it was not like this before.

The preprocessing commands are as follows: python preprocess.py \ -train_src data/QGprocess/train-src-ans-features-pos.txt \ -train_tgt data/QGprocess/train-target.txt \ -valid_src data/QGprocess/dev-src-ans-features-pos.txt \ -valid_tgt data/QGprocess/dev-target.txt \ -save_data data/QGprocess-features-pos \ -dynamic_dict \ -lower \ -overwrite \ -src_seq_length 5000 \ -tgt_seq_length 200

The output is as follows: [2020-12-10 01:52:57,466 INFO] Extracting features... [2020-12-10 01:52:58,125 INFO] number of source features: 2. [2020-12-10 01:52:58,125 INFO] number of target features: 0. [2020-12-10 01:52:58,125 INFO] Building Fields object... [2020-12-10 01:52:58,126 INFO] Building & saving training data... [2020-12-10 01:52:58,128 WARNING] Shards for corpus train already exist, will be overwritten because -overwrite option is set. [2020-12-10 01:52:58,135 WARNING] Overwrite shards for corpus None [2020-12-10 01:53:00,268 INFO] Building shard 0. [2020-12-10 01:53:06,547 INFO] saving 0th train data shard to data/QGprocess-features-pos.train.0.pt. [2020-12-10 01:53:11,137 INFO] tgt vocab size: 4. [2020-12-10 01:53:11,137 INFO] src vocab size: 2. [2020-12-10 01:53:11,138 INFO] src_feat_0 vocab size: 2. [2020-12-10 01:53:11,138 INFO] src_feat_1 vocab size: 2. [2020-12-10 01:53:11,332 INFO] Building & saving validation data... [2020-12-10 01:53:11,333 WARNING] Shards for corpus valid already exist, will be overwritten because -overwrite option is set. [2020-12-10 01:53:11,339 WARNING] Overwrite shards for corpus None [2020-12-10 01:53:12,281 INFO] Building shard 0. [2020-12-10 01:53:13,040 INFO] saving 0th valid data shard to data/QGprocess-features-pos.valid.0.pt.

It seems that it did not calculate the token in the corpus.

francoishernandez commented 3 years ago

This is strange.

chens72 commented 3 years ago

train-target.txt: what are the only teo islands belonging to the british isles ireland became an old reock in 12 , 000 bc but was n't inhabited until when ? the pictish tribe of southern ireland inhabited the islands when ? what islands were descovered in 150 ad ? john doe cites the earliest known use of brytish iles as occurring in what year ? what formed before the craton baltica and avalonia collision ? at 9 , 000 feet tall , what is the highest point on the islands ? the beginning of the last ice age occurred when ? during the briton empire , tribes spoke which dialect ? the northern third of ireland was inhabited by picts , while the southern two thirds by who ?

chens72 commented 3 years ago

When I used the following command for preprocessing, the problem was solved. onmt_preprocess \ -train_src data/Features/train-src-ans-features.txt \ -train_tgt data/Features/train-target.txt \ -valid_src data/Features/dev-src-ans-features.txt \ -valid_tgt data/Features/dev-target.txt \ -save_data data/Features/QGprocess-features \ -dynamic_dict \ -share_vocab \ -lower \ -overwrite \ -src_seq_length 3500 \ -tgt_seq_length 200