vilmara commented 4 years ago

Related to Model/Framework(s) [BERT/TensorFlow]

Describe the bug I am trying to download the datasets but getting the below errors. So how can I verify the datasets were complete and properly downloaded?. What about Wikipedia and BookCorpus datasets to train BERT-large model from scratch?

Dataset Name: bookscorpus
Gave up to open ...
local variable 'response' referenced before assignment
Failed to open https://www.smashwords.com/books/download/
HTTPError: HTTP Error 403: Forbidden
.
.
.
ValueError: /workspace/bert/data/sharded/books_wiki_en_corpus/test/books_wiki_en_corpus_test_1471.txt is not a valid path
Traceback (most recent call last):
  File "/workspace/bert/utils/create_pretraining_data.py", line 505, in <module>
    main()
  File "/workspace/bert/utils/create_pretraining_data.py", line 486, in main
    raise ValueError("{} is not a valid path".format(args.input_file))
ValueError: /workspace/bert/data/sharded/books_wiki_en_corpus/test/books_wiki_en_corpus_test_1469.txt is not a valid pat

and when running the run_pretraining_lamb.sh, the shared directory is empty $ bash scripts/run_pretraining_lamb.sh 64 8 8 7.5e-4 5e-4 fp16 true 4 2000 200 7820 100 128 512 large

  File "/workspace/bert/run_pretraining.py", line 569, in main
    raise ValueError("Input Files must be sharded")
ValueError: Input Files must be sharded

To Reproduce bash scripts/data_download.sh

Expected behavior Ignoring the HTTP 403 errors, are these the expected datasets and sizes after downloading the data?

89G     ./download/wikicorpus_en
34M     ./download/squad/v1.1
45M     ./download/squad/v2.0
78M     ./download/squad
480K    ./download/CoLA/original/raw
492K    ./download/CoLA/original/tokenized
976K    ./download/CoLA/original
1.5M    ./download/CoLA
906M    ./download/MNLI/original
1.4G    ./download/MNLI
4.0K    ./download/bookscorpus
2.9M    ./download/MRPC
417M    ./download/google_pretrained_weights/cased_L-12_H-768_A-12
394M    ./download/google_pretrained_weights/chinese_L-12_H-768_A-12
1.3G    ./download/google_pretrained_weights/uncased_L-24_H-1024_A-16
422M    ./download/google_pretrained_weights/uncased_L-12_H-768_A-12
1.3G    ./download/google_pretrained_weights/cased_L-24_H-1024_A-16
684M    ./download/google_pretrained_weights/multi_cased_L-12_H-768_A-12
643M    ./download/google_pretrained_weights/multilingual_L-12_H-768_A-12
9.7G    ./download/google_pretrained_weights
**101G    ./download**

Environment Please provide at least:

Container version (e.g. pytorch:19.05-py3): bert:latest
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4x Tesla V100-SXM2-16GB
CUDA driver version (e.g. 418.67): 440.33.01

DavidLangworthy commented 4 years ago

I am running into similar issues. Is there some trick to getting the bookscorpus to download?

swethmandava commented 4 years ago

You can just ignore the bookscorpus files that throw errors when downloading. They don't exist anymore on the web.

247 #262

ValueError: /workspace/bert/data/sharded/books_wiki_en_corpus/test/books_wiki_en_corpus_test_1471.txt is not a valid path this seems to indicate the data_download exited with errors before being able to create tfrecords/hdf5 files. Can you post the full error log? Can you also post the permissions of files under /workspace/bert/data ?

vilmara commented 4 years ago

Hi @swethmandava/ @nvcforster all files throw HTTPError: HTTP Error 403: Forbidden errors. The script can't create the tfrecords because the path to data is empty error_log.txt

swethmandava commented 4 years ago

You can ignore bookscorpus by commenting this out

Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
** Download file already exists, skipping download
Unzipping: wikicorpus_en.xml.bz2

bzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bzip2: No such file or directory
        Input file = /workspace/bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2, output file = /workspace/bert/data/download/wikicorpus_en/wikicorpus_en.xml

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

bzip2: Deleting output file /workspace/bert/data/download/wikicorpus_en/wikicorpus_en.xml, if it exists.
Traceback (most recent call last):
  File "/workspace/bert/data/bertPrep.py", line 387, in <module>
    main(args)
  File "/workspace/bert/data/bertPrep.py", line 61, in main
    downloader.download()
  File "/workspace/bert/data/Downloader.py", line 33, in download
    self.download_wikicorpus('en')
  File "/workspace/bert/data/Downloader.py", line 95, in download_wikicorpus
    downloader.download()
  File "/workspace/bert/data/WikiDownloader.py", line 54, in download
    subprocess.run('bzip2 -dk ' + self.save_path + '/' + filename, shell=True, check=True)
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'bzip2 -dk /workspace/bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2' returned non-zero exit status 2.

The error you are seeing seems to be a result of wikicorpus failing as well - can you check the permissions/downloads to verify you see /workspace/bert/data/download/wikicorpus_en/wikicorpus_en.xml.bz2 ?

vilmara commented 4 years ago

I have modified the create_datasets_from_start.sh file to download, format, shard, and create tfrecords only for English Wikipedia dataset which is still available (using the argument wikicorpus_en), and ignoring bookscorpus dataset completely in the entire script (it was grouping wiki+books under the argument books_wiki_en_corpus). Maybe it is not necessary with the latest repo since Nvidia recently updated this file to do not download and preprocess BooksCorpus dataset "due to recent issues with the host server"

NVIDIA / DeepLearningExamples

[BERT/TensorFlow] how to verify downloaded datasets #580

247 #262