NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
12.94k stars 3.12k forks source link

[Bert/Pytorch] Difference between data_download.sh and create_dataset_from scratch.sh #1326

Open wormyu opened 12 months ago

wormyu commented 12 months ago

Related to Bert/Pytorch

Describe the bug This is not a bug but a question. I'm wondering what's the difference between data_download.sh and create_dataset_from scratch.sh? In README.md the suggested way to download and preprocess data is using create_dataset_from scratch.sh, and doesn't mention the usage of data_donwload.sh.

In my understanding, in spite of downloading Wikipedia, data_donwload.sh will also download BookCorpus for pre-training usage. So what's the reason for not using data_download.sh to prepare data for pre-training.