bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
991 stars 354 forks source link

dbNSFP - End-of-centdir-64 signature not where expected #3440

Closed djb17 closed 3 years ago

djb17 commented 3 years ago

Version info

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py upgrade -u skip --datatarget dbnsfp --genomes hg38

Observed behavior Error message or bcbio output:

2021-02-22 01:53:15 (2.81 MB/s) - ‘dbNSFP4.1a.zip’ saved [30335259650]

error: End-of-centdir-64 signature not where expected (prepended bytes?)
  (attempting to process anyway)
warning [dbNSFP4.1a.zip]:  26040288077 extra bytes at beginning or within zipfile
  (attempting to process anyway)
   skipping: dbNSFP4.1a_variant.chrM.gz  need PK compat. v4.5 (can do v2.1)
Traceback (most recent call last):
  File "/home/local/bcbio/anaconda/bin/bcbio_nextgen.py", line 228, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/home/local/bcbio/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 107, in upgrade_bcbio
    upgrade_bcbio_data(args, REMOTES)
  File "/home/local/bcbio/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 359, in upgrade_bcbio_data
    args.cores, ["ggd", "s3", "raw"])
  File "/home/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/home/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 875, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/home/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/home/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/home/local/bcbio/anaconda/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/home/local/bcbio/anaconda/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bash', '/home/local/bcbio/genomes/Hsapiens/hg38/txtmp/ggd-run.sh']' returned non-zero exit status 81.

Additional context I thought the download was interrupted and file was corrupt during my 1st attempt, but this kept occurring during my 2nd and 3rd attempt.

I think this is associated with previous issue https://github.com/bcbio/bcbio-nextgen/issues/913#issue-92009357 since in ggd-run.sh still seems to be using unzip instead of p7zip to extract dbNSFP4.1a.zip.

EDIT#1: added relevant snippet of code from ggd-run.sh

if [ ! -f dbNSFP.txt.gz ]; then
  UNPACK_DIR=`pwd`/tmpunpack
  mkdir -p $UNPACK_DIR
  unzip dbNSFP*.zip "dbNSFP*_variant.chrM.gz" # Potentially problematic line?
  gunzip dbNSFP*_variant.chrM.gz
  head -n1 dbNSFP*_variant.chrM > $UNPACK_DIR/header.txt
  rm dbNSFP*_variant.chrM
  # unzip only files with chromosomal info, eg. skip genes and readme.
  cat $UNPACK_DIR/header.txt > dbNSFP.txt
  unzip -p dbNSFP*.zip "dbNSFP*_variant.chr*.gz" | gunzip -c | grep -v '^#chr' | sort -T $UNPACK_DIR -k1,1 -k2,2n >> dbNSFP.txt
  bgzip dbNSFP.txt
  #extract readme file, used by VEP plugin to add vcf header info
  unzip -p dbNSFP*.zip "*readme.txt" > dbNSFP.readme.txt
fi

EDIT#2: added result from file integrity check using 7z

7z t dbNSFP4.1a.zip

7-Zip [64] 15.09 beta : Copyright (c) 1999-2015 Igor Pavlov : 2015-10-16
p7zip Version 15.09 beta (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,24 CPUs x64)

Scanning the drive for archives:
1 file, 30335259650 bytes (29 GiB)

Testing archive: dbNSFP4.1a.zip
--         
Path = dbNSFP4.1a.zip
Type = zip
Physical Size = 30335259650
64-bit = +

Everything is Ok                     

Files: 36
Size:       30471606961
Compressed: 30335259650
roryk commented 3 years ago

Wow, sorry to make you have to figure this out yourself. Thank you so much for figuring out the problem, I agree it's probably from the use of unzip. The ggd recipes are here: https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes do you want to do a p/r a swap to p7zip there for those recipes?

djb17 commented 3 years ago

Seems to be working now. Thanks!

roryk commented 3 years ago

Thanks for the fix!