Closed WorldWarII closed 3 years ago
Noticing that in Preprocssing period 2 Fairseq Processing, we use this command:
python preprocess.py --only-source --trainpref /path/to/train.txt --validpref path/to/valid.txt --srcdict /path/to/dict.txt --destdir /path/to/destination_dir --padding-factor 1 --workers 48
Excuse me but i don't understand what is in --trainpref and --validpref? What is the relation between the output of bpe_tokenize.py and the input of preprocess.py? Do I need to split my output of bpe_tokenization into two parts?(train and valid)
Right. You could also split the original corpus into two parts and then run bpe_tokenize over each of them.
Noticing that in Preprocssing period 2 Fairseq Processing, we use this command:
python preprocess.py --only-source --trainpref /path/to/train.txt --validpref path/to/valid.txt --srcdict /path/to/dict.txt --destdir /path/to/destination_dir --padding-factor 1 --workers 48
Excuse me but i don't understand what is in --trainpref and --validpref? What is the relation between the output of bpe_tokenize.py and the input of preprocess.py? Do I need to split my output of bpe_tokenization into two parts?(train and valid)