facebookresearch / SpanBERT

Code for using and evaluating SpanBERT.
Other
884 stars 174 forks source link

What 's in train.txt and valid.txt these two files during preprocessing #70

Closed WorldWarII closed 3 years ago

WorldWarII commented 3 years ago

Noticing that in Preprocssing period 2 Fairseq Processing, we use this command:

python preprocess.py --only-source --trainpref /path/to/train.txt --validpref path/to/valid.txt --srcdict /path/to/dict.txt --destdir /path/to/destination_dir --padding-factor 1 --workers 48

Excuse me but i don't understand what is in --trainpref and --validpref? What is the relation between the output of bpe_tokenize.py and the input of preprocess.py? Do I need to split my output of bpe_tokenization into two parts?(train and valid)

mandarjoshi90 commented 3 years ago

Right. You could also split the original corpus into two parts and then run bpe_tokenize over each of them.