IntelLabs / academic-budget-bert

Repository containing code for "How to Train BERT with an Academic Budget" paper
Apache License 2.0
309 stars 47 forks source link

How to combine wiki and bookcorpus into one file? #20

Closed shizhediao closed 2 years ago

shizhediao commented 2 years ago

I found that in the dataset description, we can Use process_data.py for pre-processing wikipedia/bookcorpus datasets into a single text file.

What if I want to process these two datasets at the same time? At which step should I combine them? Thanks!

peteriz commented 2 years ago

Hi @shizhediao You can run process_data.py in parallel, one on wiki and another on bookcorpus. You don't need to combine them, when running the sharding script (shard_data.py) you need to provide a path to a directory containing all corpus text files. The sharding process creates shards with samples from each dataset.

shizhediao commented 2 years ago

Thanks! I found that the result of running process_data.py for wiki is like this figure. I was wondering is it useful for the enwiki.processed/wiki dir? Is it ok if I just put the bookcorpus_one_article_per_line.txt and wiki_one_article_per_line.txt into the same directory and then run shard_data.py?

image
peteriz commented 2 years ago

You only need to provide the *_one_article_per_line.txt files in one directory for shard_data.py

shizhediao commented 2 years ago

OK, thanks!