Frustration downloading gnomad genomes

kokyriakidis commented 5 years ago

Although it downloaded gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz and the *.tbi, these files went missing and the file "gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz " started downloading again? Any thoughts? @roryk

Saving to: ‘gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz’

gnomad.genomes.r2.1 100%[===================>]  27,10G   713KB/s    in 12h 51m 

2019-09-27 22:32:26 (614 KB/s) - ‘gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz’ saved [29101164808/29101164808]

--2019-09-27 22:32:26--  http://ftp.ensemblorg.ebi.ac.uk/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad/r2.1/genomes/gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz.tbi
Resolving ftp.ensemblorg.ebi.ac.uk (ftp.ensemblorg.ebi.ac.uk)... 193.62.193.8
Connecting to ftp.ensemblorg.ebi.ac.uk (ftp.ensemblorg.ebi.ac.uk)|193.62.193.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 223851 (219K) [application/x-gzip]
Saving to: ‘gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz.tbi’

gnomad.genomes.r2.1 100%[===================>] 218,60K   441KB/s    in 0,5s    

2019-09-27 22:32:27 (441 KB/s) - ‘gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz.tbi’ saved [223851/223851]

--2019-09-27 22:33:14--  http://ftp.ensemblorg.ebi.ac.uk/pub/data_files/homo_sapiens/GRCh38/variation_genotype/gnomad/r2.1/genomes/gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz
Resolving ftp.ensemblorg.ebi.ac.uk (ftp.ensemblorg.ebi.ac.uk)... 193.62.193.8
Connecting to ftp.ensemblorg.ebi.ac.uk (ftp.ensemblorg.ebi.ac.uk)|193.62.193.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29101164808 (27G) [application/x-gzip]
Saving to: ‘gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz’

gnomad.genomes.r2.1 6%[>                      ]  1,69G   670KB/s    eta 13h 47m

naumenko-sa commented 5 years ago

Hi Konstantinos @kokyriakidis !

Thanks for reporting the issue and apologies that gnomad genome installation did not work for you. I've fixed the recipe for gnomad genome and have tested it: https://github.com/chapmanb/cloudbiolinux/pull/323

Usually Brad is super-fast at merging, but if you want to try it even faster, you could substitute the recipe directly in your bcbio installation:

the recipe: https://github.com/naumenko-sa/cloudbiolinux/blob/master/ggd-recipes/hg38/gnomad.yaml
how to apply the recipe to the interrupted installation: https://github.com/chapmanb/cloudbiolinux/blob/master/doc/hacking.md#testing-a-ggd-recipe

I have finished a small test, and I started a full-size one on the machine I have.

I am very curious what will be the running time of the recipe in your environment.

how long the downloading of 460G gnomad vcf takes place;
what is your location;
how long the filtration goes;
what is the size of the resulting file;

It would be a great contribution to bcbio if you provided those estimates for your system.

We had a big debate on whether we should switch to gnomad2.1.1 at all, because of its huge size compared to the previous 2.0.1 release.

Sergey

kokyriakidis commented 5 years ago

Hi @naumenko-sa

Thanks for the quick fix! It seems it works now. I will update my post providing the requests you made :)

I can only tell for now that the change you made had great impact! Now I can download the files using my maximum speed! This is a HUGE plus. This was the right move and thanks for implementing!

The ETA is 19h 32m using the MAX of my download speed (10-11MB/s). I had 500kbps download speed with the previous files and ETA 16h only for the chr1 file.... The size is 743GB. The location is a RED 6TB drive attached to a UBUNTU PC.

kokyriakidis commented 5 years ago

@naumenko-sa @roryk

I got these warnings when the gnomad download finished and started the filtration process

Warning: The tag "AC_nfe_bgr" not defined in the header
Warning: The tag "AN_nfe_bgr" not defined in the header
Warning: The tag "AF_nfe_bgr" not defined in the header
Warning: The tag "nhomalt_nfe_bgr" not defined in the header
Warning: The tag "AC_sas_male" not defined in the header
Warning: The tag "AN_sas_male" not defined in the header
Warning: The tag "AF_sas_male" not defined in the header
Warning: The tag "nhomalt_sas_male" not defined in the header
Warning: The tag "AC_sas" not defined in the header
Warning: The tag "AN_sas" not defined in the header
Warning: The tag "AF_sas" not defined in the header
Warning: The tag "nhomalt_sas" not defined in the header
Warning: The tag "AC_nfe_swe" not defined in the header
Warning: The tag "AN_nfe_swe" not defined in the header
Warning: The tag "AF_nfe_swe" not defined in the header
Warning: The tag "nhomalt_nfe_swe" not defined in the header
Warning: The tag "AC_eas_jpn" not defined in the header
Warning: The tag "AN_eas_jpn" not defined in the header
Warning: The tag "AF_eas_jpn" not defined in the header
Warning: The tag "nhomalt_eas_jpn" not defined in the header
Warning: The tag "AC_eas_kor" not defined in the header
Warning: The tag "AN_eas_kor" not defined in the header
Warning: The tag "AF_eas_kor" not defined in the header
Warning: The tag "nhomalt_eas_kor" not defined in the header
Warning: The tag "AC_eas_oea" not defined in the header
Warning: The tag "AN_eas_oea" not defined in the header
Warning: The tag "AF_eas_oea" not defined in the header
Warning: The tag "nhomalt_eas_oea" not defined in the header
Warning: The tag "AC_sas_female" not defined in the header
Warning: The tag "AN_sas_female" not defined in the header
Warning: The tag "AF_sas_female" not defined in the header
Warning: The tag "nhomalt_sas_female" not defined in the header

Is this something to worry about?

It took 13h for filtering on a 7900x processor. The final size is 55GB

kokyriakidis commented 5 years ago

I also got the following error after some time but I rerun my command and everything works now.

Running GGD recipe: hg38 dream-syn4-crossmap 2014-08-09
--2019-09-30 01:42:37--  https://s3.amazonaws.com/bcbio_nextgen/dream/synthetic_challenge_set4_tumour_25pctmasked_truth-crossmap-hg38.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... failed: Name or service not known.
wget: unable to resolve host address ‘s3.amazonaws.com’
Traceback (most recent call last):
  File "/RED/RESOURCES/bcbio/anaconda/bin/bcbio_nextgen.py", line 228, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 106, in upgrade_bcbio
    upgrade_bcbio_data(args, REMOTES)
  File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 348, in upgrade_bcbio_data
    args.cores, ["ggd", "s3", "raw"])
  File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 868, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bash', '/RED/RESOURCES/bcbio/genomes/Hsapiens/hg38/txtmp/ggd-run.sh']' returned non-zero exit status 4.
Traceback (most recent call last):
  File "bcbio_nextgen_install.py", line 288, in <module>
    main(parser.parse_args(), sys.argv[1:])
  File "bcbio_nextgen_install.py", line 45, in main
    subprocess.check_call([bcbio, "upgrade"] + _clean_args(sys_argv, args))
  File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/RED/RESOURCES/bcbio/anaconda/bin/bcbio_nextgen.py', 'upgrade', '--tooldir=/RED/TOOLS', '--genomes', 'hg38', '--distribution', 'ubuntu', '-u', 'development', '--aligners', 'bowtie', '--aligners', 'bowtie2', '--aligners', 'bwa', '--aligners', 'minimap2', '--aligners', 'star', '--aligners', 'hisat2', '--datatarget', 'variation', '--datatarget', 'rnaseq', '--datatarget', 'smallrna', '--datatarget', 'gemini', '--datatarget', 'cadd', '--datatarget', 'vep', '--datatarget', 'dbnsfp', '--datatarget', 'battenberg', '--datatarget', 'kraken', '--datatarget', 'ericscript', '--datatarget', 'gnomad', '--cores', '20', '--toolplus', 'gatk=/home/kokyriakidis/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar', '--data']' returned non-zero exit status 1

naumenko-sa commented 5 years ago

Hi Konstantinos @kokyriakidis!

Thanks for testing the recipe!

I think it is safe to ignore these warning: we are using the same list of vcf tags to include for both gnomad genome and exome, and it looks like genome file does not contain certain sub-population, i.e. bgr (Bulgarian genomes). Probably, there is just no genomes from these sub-populations to include, but they had exomes sequenced.

Thanks for benchmarking as well! So it was 19h + 13h for downloading + filtering = 32h, and the file size was reduced from 743G to 55G.

I think it is a reasonable price to have gnomad 2.1.1 for variant prioritization, we don't need to switch back to 2.0.1

Sergey

kokyriakidis commented 5 years ago

Yes! Just stick to 2.1.1 which has better download times! Thanks for your help! :-)

bcbio / bcbio-nextgen

Frustration downloading gnomad genomes #2956