Closed kokyriakidis closed 5 years ago
Hi Konstantinos @kokyriakidis !
Thanks for reporting the issue and apologies that gnomad genome installation did not work for you. I've fixed the recipe for gnomad genome and have tested it: https://github.com/chapmanb/cloudbiolinux/pull/323
Usually Brad is super-fast at merging, but if you want to try it even faster, you could substitute the recipe directly in your bcbio installation:
I have finished a small test, and I started a full-size one on the machine I have.
I am very curious what will be the running time of the recipe in your environment.
It would be a great contribution to bcbio if you provided those estimates for your system.
We had a big debate on whether we should switch to gnomad2.1.1 at all, because of its huge size compared to the previous 2.0.1 release.
Sergey
Hi @naumenko-sa
Thanks for the quick fix! It seems it works now. I will update my post providing the requests you made :)
I can only tell for now that the change you made had great impact! Now I can download the files using my maximum speed! This is a HUGE plus. This was the right move and thanks for implementing!
The ETA is 19h 32m using the MAX of my download speed (10-11MB/s). I had 500kbps download speed with the previous files and ETA 16h only for the chr1 file.... The size is 743GB. The location is a RED 6TB drive attached to a UBUNTU PC.
@naumenko-sa @roryk
I got these warnings when the gnomad download finished and started the filtration process
Warning: The tag "AC_nfe_bgr" not defined in the header
Warning: The tag "AN_nfe_bgr" not defined in the header
Warning: The tag "AF_nfe_bgr" not defined in the header
Warning: The tag "nhomalt_nfe_bgr" not defined in the header
Warning: The tag "AC_sas_male" not defined in the header
Warning: The tag "AN_sas_male" not defined in the header
Warning: The tag "AF_sas_male" not defined in the header
Warning: The tag "nhomalt_sas_male" not defined in the header
Warning: The tag "AC_sas" not defined in the header
Warning: The tag "AN_sas" not defined in the header
Warning: The tag "AF_sas" not defined in the header
Warning: The tag "nhomalt_sas" not defined in the header
Warning: The tag "AC_nfe_swe" not defined in the header
Warning: The tag "AN_nfe_swe" not defined in the header
Warning: The tag "AF_nfe_swe" not defined in the header
Warning: The tag "nhomalt_nfe_swe" not defined in the header
Warning: The tag "AC_eas_jpn" not defined in the header
Warning: The tag "AN_eas_jpn" not defined in the header
Warning: The tag "AF_eas_jpn" not defined in the header
Warning: The tag "nhomalt_eas_jpn" not defined in the header
Warning: The tag "AC_eas_kor" not defined in the header
Warning: The tag "AN_eas_kor" not defined in the header
Warning: The tag "AF_eas_kor" not defined in the header
Warning: The tag "nhomalt_eas_kor" not defined in the header
Warning: The tag "AC_eas_oea" not defined in the header
Warning: The tag "AN_eas_oea" not defined in the header
Warning: The tag "AF_eas_oea" not defined in the header
Warning: The tag "nhomalt_eas_oea" not defined in the header
Warning: The tag "AC_sas_female" not defined in the header
Warning: The tag "AN_sas_female" not defined in the header
Warning: The tag "AF_sas_female" not defined in the header
Warning: The tag "nhomalt_sas_female" not defined in the header
Is this something to worry about?
It took 13h for filtering on a 7900x processor. The final size is 55GB
I also got the following error after some time but I rerun my command and everything works now.
Running GGD recipe: hg38 dream-syn4-crossmap 2014-08-09
--2019-09-30 01:42:37-- https://s3.amazonaws.com/bcbio_nextgen/dream/synthetic_challenge_set4_tumour_25pctmasked_truth-crossmap-hg38.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... failed: Name or service not known.
wget: unable to resolve host address ‘s3.amazonaws.com’
Traceback (most recent call last):
File "/RED/RESOURCES/bcbio/anaconda/bin/bcbio_nextgen.py", line 228, in <module>
install.upgrade_bcbio(kwargs["args"])
File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 106, in upgrade_bcbio
upgrade_bcbio_data(args, REMOTES)
File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 348, in upgrade_bcbio_data
args.cores, ["ggd", "s3", "raw"])
File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
_prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
retrieve_fn(env, manager, gid, idx)
File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 868, in _install_with_ggd
ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
recipe["recipe"]["full"]["recipe_type"], system_install)
File "/RED/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
subprocess.check_output(["bash", run_file])
File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/RED/RESOURCES/bcbio/anaconda/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bash', '/RED/RESOURCES/bcbio/genomes/Hsapiens/hg38/txtmp/ggd-run.sh']' returned non-zero exit status 4.
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 288, in <module>
main(parser.parse_args(), sys.argv[1:])
File "bcbio_nextgen_install.py", line 45, in main
subprocess.check_call([bcbio, "upgrade"] + _clean_args(sys_argv, args))
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/RED/RESOURCES/bcbio/anaconda/bin/bcbio_nextgen.py', 'upgrade', '--tooldir=/RED/TOOLS', '--genomes', 'hg38', '--distribution', 'ubuntu', '-u', 'development', '--aligners', 'bowtie', '--aligners', 'bowtie2', '--aligners', 'bwa', '--aligners', 'minimap2', '--aligners', 'star', '--aligners', 'hisat2', '--datatarget', 'variation', '--datatarget', 'rnaseq', '--datatarget', 'smallrna', '--datatarget', 'gemini', '--datatarget', 'cadd', '--datatarget', 'vep', '--datatarget', 'dbnsfp', '--datatarget', 'battenberg', '--datatarget', 'kraken', '--datatarget', 'ericscript', '--datatarget', 'gnomad', '--cores', '20', '--toolplus', 'gatk=/home/kokyriakidis/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar', '--data']' returned non-zero exit status 1
Hi Konstantinos @kokyriakidis!
Thanks for testing the recipe!
I think it is safe to ignore these warning: we are using the same list of vcf tags to include for both gnomad genome and exome, and it looks like genome file does not contain certain sub-population, i.e. bgr (Bulgarian genomes). Probably, there is just no genomes from these sub-populations to include, but they had exomes sequenced.
Thanks for benchmarking as well! So it was 19h + 13h for downloading + filtering = 32h, and the file size was reduced from 743G to 55G.
I think it is a reasonable price to have gnomad 2.1.1 for variant prioritization, we don't need to switch back to 2.0.1
Sergey
Yes! Just stick to 2.1.1 which has better download times! Thanks for your help! :-)
Although it downloaded gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz and the *.tbi, these files went missing and the file "gnomad.genomes.r2.1.sites.grch38.chr1_noVEP.vcf.gz " started downloading again? Any thoughts? @roryk