paulhendricks commented 4 years ago

Related to Model/Framework(s) PyTorch/LanguageModeling/BERT

Describe the bug BookCorpus no longer available from Smashwords.

To Reproduce

The following works perfectly.

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples
cd PyTorch/LanguageModeling/BERT
bash scripts/docker/build.sh
bash scripts/docker/launch.sh

However, errors start here:

bash data/create_datasets_from_start.sh

root@dgxstation:/workspace/bert# bash data/create_datasets_from_start.sh
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Working Directory: /workspace/bert/data
Action: download
Dataset Name: bookscorpus

Directory Structure:
{ 'download': '/workspace/bert/data/download',
  'extracted': '/workspace/bert/data/extracted',
  'formatted': '/workspace/bert/data/formatted_one_article_per_line',
  'hdf5': '/workspace/bert/data/hdf5_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5',
  'sharded': '/workspace/bert/data/sharded_training_shards_256_test_shards_256_fraction_0.2',
  'tfrecord': '/workspace/bert/data/tfrecord_lower_case_1_seq_len_512_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5'}

0 files had already been saved in /workspace/bert/data/download/bookscorpus.
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt
HTTPError: HTTP Error 403: Forbidden
 Gave up to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt

Expected behavior

BookCorpus should download. This looks similar to:

Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.

Screen Shot 2020-05-28 at 1 42 16 PM

Environment Please provide at least:

Git commit: c76880b0fb211671b83bec47576305d424617009

Container version (e.g. pytorch:19.05-py3):

Step 1/15 : ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
Step 2/15 : FROM nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk as trt

GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4x Tesla V100, DGX Station
CUDA driver version (e.g. 418.67): 418.126.02

swethmandava commented 4 years ago

You can just ignore the bookscorpus files that are missing. They dont exist anymore on the web.

247 #262

vilmara commented 4 years ago

Hi @swethmandava, the script run_pretraining_lamb.sh throws a bunch of errors because it is still referencing these datasets, is there another script to conduct the pre-training process on BERT with available datasets?

swethmandava commented 4 years ago

Could you open another bug with details of your errors? @vilmara

vilmara commented 4 years ago

I have found how to work only with English Wikipedia dataset which is still available, ignoring BookCorpus dataset

NVIDIA / DeepLearningExamples

[PyTorch/LanguageModeling/BERT] BookCorpus Data Download - HTTPError: HTTP Error 403: Forbidden #536

247 #262