NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.55k stars 3.23k forks source link

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

Closed paulhendricks closed 4 years ago

paulhendricks commented 4 years ago

Related to Model/Framework(s) PyTorch/LanguageModeling/BERT

Describe the bug BookCorpus no longer available from Smashwords.

To Reproduce

The following works perfectly.

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples
cd PyTorch/LanguageModeling/BERT
bash scripts/docker/build.sh
bash scripts/docker/launch.sh

However, errors start here:

bash data/create_datasets_from_start.sh
root@dgxstation:/workspace/bert# bash data/create_datasets_from_start.sh
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Working Directory: /workspace/bert/data
Action: download
Dataset Name: bookscorpus

Directory Structure:
{ 'download': '/workspace/bert/data/download',
  'extracted': '/workspace/bert/data/extracted',
  'formatted': '/workspace/bert/data/formatted_one_article_per_line',
  'hdf5': '/workspace/bert/data/hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5',
  'sharded': '/workspace/bert/data/sharded_training_shards_256_test_shards_256_fraction_0.2',
  'tfrecord': '/workspace/bert/data/tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'}

0 files had already been saved in /workspace/bert/data/download/bookscorpus.
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
 Gave up to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt

Expected behavior

BookCorpus should download. This looks similar to:

Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.

Screen Shot 2020-05-28 at 1 42 16 PM

Environment Please provide at least:

swethmandava commented 4 years ago

You can just ignore the bookscorpus files that are missing. They dont exist anymore on the web.

247 #262

vilmara commented 4 years ago

Hi @swethmandava, the script run_pretraining_lamb.sh throws a bunch of errors because it is still referencing these datasets, is there another script to conduct the pre-training process on BERT with available datasets?

swethmandava commented 4 years ago

Could you open another bug with details of your errors? @vilmara

vilmara commented 4 years ago

I have found how to work only with English Wikipedia dataset which is still available, ignoring BookCorpus dataset