[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages

Related to ELECTRA/TF2

Is your feature request related to a problem? Please describe. I have previously run the README command to download the wiki data: /workspace/electra/data/create_datasets_from_start.sh wiki_only It has spent a long time downloading the bzip2 file, and then a long time to unzip it to the unzip'd 90Gb file -rw-r--r-- 1 nobody nogroup 94,992,294,413 Jun 28 16:29 wikicorpus_en.xml

I wanted to rerun the script, to re-create the datasets. The script correctly spots that the bz2 file already exists, and doesn't attempt to re-download it: Downloading: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 ** Download file already exists, skipping download However, it does not seem to spot that the file has previously been unzip'd, and tries to re-unzip it: Unzipping: wikicorpus_en.xml.bz2 bzip2: Can't create output file /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml: File exists. Traceback (most recent call last): File "/workspace/electra/data/dataPrep.py", line 312, in main(args) File "/workspace/electra/data/dataPrep.py", line 59, in main downloader.download() File "/workspace/electra/data/Downloader.py", line 33, in download self.download_wikicorpus('en') File "/workspace/electra/data/Downloader.py", line 71, in download_wikicorpus downloader.download() File "/workspace/electra/data/WikiDownloader.py", line 54, in download

      subprocess.run('bzip2 -dk ' + self.save_path + '/' + filename, shell=True, check=True)
    File "/usr/lib/python3.6/subprocess.py", line 438, in run
      output=stdout, stderr=stderr)
  subprocess.CalledProcessError: Command 'bzip2 -dk /workspace/electra/data/download/wikicorpus_en/wikicorpus_en.xml.bz2' returned non-zero exit status 1.

Describe the solution you'd like The end-result is fine, in that the script continues, but it would be good to perhaps check for the unzip'd file existence, so as to avoid unnecessary error messages and python stack traceback output.

Describe alternatives you've considered It could be documented in the README. The script should definitely not remove any unzip'd file, as it takes a substantial amount of time to unzip.

Additional context None

NVIDIA / DeepLearningExamples

[ELECTRA/TF2] Creation Of Datasets Should Check for Existence Of unzip'd File To Avoid Error Messages #1320