State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
12.94k
stars
3.12k
forks
source link
[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages #1320
Is your feature request related to a problem? Please describe.
I have previously run the README command to download the wiki data:
/workspace/electra/data/create_datasets_from_start.sh wiki_only
It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file
-rw-r--r-- 1 nobody nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml
I wanted to rerun the script, to re-create the datasets.
The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it:
Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
** Download file already exists, skipping download
However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it:
Unzipping: wikicorpus_en.xml.bz2
bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists.
Traceback (most recent call last):
File "/workspace/electra/data/dataPrep.py", line 312, in
main(args)
File "/workspace/electra/data/dataPrep.py", line 59, in main
downloader.download()
File "/workspace/electra/data/Downloader.py", line 33, in download
self.download_wikicorpus('en')
File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus
downloader.download()
File "/workspace/electra/data/WikiDownloader.py", line 54, in download
subprocess.run('bzip2 -dk ' + self.save_path + '/' + filename, shell=True, check=True)
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'bzip2 -dk /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml.bz2' returned non-zero exit status 1.
Describe the solution you'd like
The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.
Describe alternatives you've considered
It could be documented in the README.
The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.
Related to ELECTRA/TF2
Is your feature request related to a problem? Please describe. I have previously run the README command to download the wiki data: /workspace/electra/data/create_datasets_from_start.sh wiki_only It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file -rw-r--r-- 1 nobody nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml
I wanted to rerun the script, to re-create the datasets. The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it: Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ** Download file already exists, skipping download However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it: Unzipping: wikicorpus_en.xml.bz2 bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists. Traceback (most recent call last): File "/workspace/electra/data/dataPrep.py", line 312, in
main(args)
File "/workspace/electra/data/dataPrep.py", line 59, in main
downloader.download()
File "/workspace/electra/data/Downloader.py", line 33, in download
self.download_wikicorpus('en')
File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus
downloader.download()
File "/workspace/electra/data/WikiDownloader.py", line 54, in download
Describe the solution you'd like The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.
Describe alternatives you've considered It could be documented in the README. The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.
Additional context None