Closed David-Levinthal closed 5 years ago
bookcorpus/download_files.py was cloned from this repository https://github.com/soskek/bookcorpus
yes I found that and have more firewall /http issues yesterday tried again just now..and perhaps this now works ittermittently but I get messages like: Failed to open https://www.smashwords.com/books/download/490185/8/latest/0/0/existence.epub HTTPError: HTTP Error 503: Service Temporarily Unavailable Succeeded in opening https://www.smashwords.com/books/download/490185/8/latest/0/0/existence.epub
trying just the bookcorpus download distributed here results in: ~/DeepLearningExamples/TensorFlow/LanguageModeling/BERT$ sudo bash scripts/data_download4.sh
NVIDIA Release 19.03 (build 5809531) TensorFlow Version 1.13.1
Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2018 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: Detected MOFED driver 3.0-1, but this container has version 4.4-1.0.0. Unable to automatically upgrade this container. Use of RDMA for multi-node communication will be unreliable.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected. Multi-node communication performance may be reduced.
0 files had already been saved in /workspace/bert/data/bookcorpus/download. Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt | URLError: <urlopen error [Errno -3] Temporary failure in name resolution> Failed to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt URLError: <urlopen error [Errno -3] Temporary failure in name resolution> Gave up to open https://www.smashwords.com/books/download/246580/6/latest/0/0/silence.txt local variable 'response' referenced before assignment Failed to open https://www.smashwords.com/books/download/88690/6/latest/0/0/how-to-be-free.txt URLError: <urlopen error [Errno -3] Temporary failure in name resolution> Failed to open https://www.smashwords.com/books/download/88690/6/latest/0/0/how-to-be-free.txt URLError: <urlopen error [Errno -3] Temporary failure in name resolution> Gave up to open https://www.smashwords.com/books/download/88690/6/latest/0/0/how-to-be-free.txt local variable 'response' referenced before assignment
Hi David,
The 503 error is likely due to a server overload or maintenance and is unfortunately outside of our control. In my experience a retry a couple hours later seems to work. On the up side, the downloader script for BookCorpus is smart enough to skip already downloaded items, so multiple attempts to get all of the books is easier.
Docker can be configured to use HTTP_PROXY and HTTPS_PROXY environment variables. These can be passed in manually by modifying the script here by adding '-e HTTP_PROXY=your.httpproxyserver.com:optionalport' to the docker run command. The same step can be repeated for https.
If you prefer to do this step outside of the container, copying and modifying this line of code is possible after cloning the BooksCorpus downloader repo on your host machine. The resulting download directory can be mounted on the docker run command referenced above. Hopefully this helps. Please let us know if you continue to experience problems.
It seems I cannot get docker to correctly access the dns system and resolve IP addresses. Thus I have had to run the data downloads manually. however I cannot find the download_files.py script needed for the following line in bookcorpus/run_preprocessing.sh
python3 /workspace/bookcorpus/download_files.py --list /workspace/bookcorpus/url_list.jsonl --out ${WORKING_DIR}/download --trash-bad-count
which makes the bookcorpus a bit difficult to get