Closed shizhediao closed 2 years ago
Hi @shizhediao
You can run process_data.py
in parallel, one on wiki and another on bookcorpus.
You don't need to combine them, when running the sharding script (shard_data.py
) you need to provide a path to a directory containing all corpus text files. The sharding process creates shards with samples from each dataset.
Thanks!
I found that the result of running process_data.py
for wiki is like this figure. I was wondering is it useful for the enwiki.processed/wiki
dir?
Is it ok if I just put the bookcorpus_one_article_per_line.txt
and wiki_one_article_per_line.txt
into the same directory and then run shard_data.py
?
You only need to provide the *_one_article_per_line.txt
files in one directory for shard_data.py
OK, thanks!
I found that in the dataset description, we can
Use process_data.py for pre-processing wikipedia/bookcorpus datasets into a single text file.
What if I want to process these two datasets at the same time? At which step should I combine them? Thanks!