bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
981 stars 355 forks source link

gnomAD genomes ggd recipe failing on vt normalize #3327

Closed ameynert closed 3 years ago

ameynert commented 3 years ago

I've traced the error to vt normalize in this line of ggd-run.sh:

bcftools view -f PASS $vcf_file | bcftools annotate -x "^$fields_to_keep" -Ov | vt decompose -s - | vt normalize -r $ref -n - | vt uniq - | bgzip -c > variation/gnomad_genome.vcf.gz

I pulled out the first 10k lines of the downloaded gnomAD genomes VCF for testing:

[ameynert@ultra txtmp]$ ref=../seq/hg38.fa
[ameynert@ultra txtmp]$ fields_to_keep="INFO/"$(cat gnomad_fields_to_keep.txt | paste -s | sed s/"\t"/",INFO\/"/g)
[ameynert@ultra txtmp]$ zcat gnomad_genome.annotated_decomposed.vcf.gz | head -n 10000 > test10k.vcf
[ameynert@ultra txtmp]$ bcftools view -f PASS test10k.vcf | bcftools annotate -x "^$fields_to_keep" -Ov | vt decompose -s - > test10k_pass_ann_decomp.vcf
decompose v0.5

options:     input VCF file        -
         [s] smart decomposition   true (experimental)
         [o] output VCF file       -

<snipped tag-related warnings>

stats: no. variants                 : 6500
       no. biallelic variants       : 6500
       no. multiallelic variants    : 0

       no. additional biallelics    : 0
       total no. of biallelics      : 6500

Time elapsed: 0.26s

[ameynert@ultra txtmp]$ vt normalize -r $ref -n test10k_pass_ann_decomp.vcf > test10k_pass_ann_decomp_norm.vcf
normalize v0.5

options:     input VCF file                                  test10k_pass_ann_decomp.vcf
         [o] output VCF file                                 -
         [w] sorting window size                             10000
         [n] no fail on reference inconsistency for non SNPs true
         [q] quiet                                           false
         [d] debug                                           false
         [r] reference FASTA file                            ../seq/hg38.fa

Floating point exception (core dumped)

Full bcbio install log from relevant point:

2020-08-19 16:10:19 (8.78 MB/s) - ‘gnomad.genomes.r3.0.sites.vcf.bgz.tbi’ saved [2933246/2933246]

decompose vuniq vnormalize v0.50.57
0.5

options:     input VCF file        options:     input VCF file        
--options:     input VCF file                                  

-         [o] output VCF file                [s] smart decomposition   
-true         [o] output VCF file                                 
 (experimental)
-         [o] output VCF file       
-         [w] sorting window size                             

10000
         [n] no fail on reference inconsistency for non SNPs true
         [q] quiet                                           false
         [d] debug                                           false
         [r] reference FASTA file                            ../seq/hg38.fa

Warning: The tag "BaseQRankSum" not defined in the header
Warning: The tag "ClippingRankSum" not defined in the header
Warning: The tag "allele_type" not defined in the header
Warning: The tag "AC_nfe_seu" not defined in the header
Warning: The tag "AN_nfe_seu" not defined in the header
Warning: The tag "AF_nfe_seu" not defined in the header
Warning: The tag "nhomalt_nfe_seu" not defined in the header
Warning: The tag "AC_nfe_bgr" not defined in the header
Warning: The tag "AN_nfe_bgr" not defined in the header
Warning: The tag "AF_nfe_bgr" not defined in the header
Warning: The tag "nhomalt_nfe_bgr" not defined in the header
Warning: The tag "AC_nfe_onf" not defined in the header
Warning: The tag "AN_nfe_onf" not defined in the header
Warning: The tag "AF_nfe_onf" not defined in the header
Warning: The tag "nhomalt_nfe_onf" not defined in the header
Warning: The tag "AC_nfe_swe" not defined in the header
Warning: The tag "AN_nfe_swe" not defined in the header
Warning: The tag "AF_nfe_swe" not defined in the header
Warning: The tag "nhomalt_nfe_swe" not defined in the header
Warning: The tag "AC_nfe_nwe" not defined in the header
Warning: The tag "AN_nfe_nwe" not defined in the header
Warning: The tag "AF_nfe_nwe" not defined in the header
Warning: The tag "nhomalt_nfe_nwe" not defined in the header
Warning: The tag "AC_eas_jpn" not defined in the header
Warning: The tag "AN_eas_jpn" not defined in the header
Warning: The tag "AF_eas_jpn" not defined in the header
Warning: The tag "nhomalt_eas_jpn" not defined in the header
Warning: The tag "AC_eas_kor" not defined in the header
Warning: The tag "AN_eas_kor" not defined in the header
Warning: The tag "AF_eas_kor" not defined in the header
Warning: The tag "nhomalt_eas_kor" not defined in the header
Warning: The tag "AC_eas_oea" not defined in the header
Warning: The tag "AN_eas_oea" not defined in the header
Warning: The tag "AF_eas_oea" not defined in the header
Warning: The tag "nhomalt_eas_oea" not defined in the header
Warning: The tag "AC_nfe_est" not defined in the header
Warning: The tag "AN_nfe_est" not defined in the header
Warning: The tag "AF_nfe_est" not defined in the header
Warning: The tag "nhomalt_nfe_est" not defined in the header
Warning: The tag "faf95" not defined in the header
Warning: The tag "faf99" not defined in the header
Warning: The tag "popmax" not defined in the header
Warning: The tag "AC_popmax" not defined in the header
Warning: The tag "AN_popmax" not defined in the header
Warning: The tag "AF_popmax" not defined in the header
Warning: The tag "nhomalt_popmax" not defined in the header

stats: Total number of observed variants   0
       Total number of unique variants     0

Time elapsed: 0.00s

/home/u035/project/software/bcbio/genomes/Hsapiens/hg38/txtmp/ggd-run.sh: line 15: 700340 Broken pipe             bcftools view -f PASS $vcf_file
     700341                       | bcftools annotate -x "^$fields_to_keep" -Ov
     700342                       | vt decompose -s -
     700343 Floating point exception(core dumped) | vt normalize -r $ref -n -
     700344 Done                    | vt uniq -
     700345 Done                    | bgzip -c > variation/gnomad_genome.vcf.gz
Upgrading bcbio
Upgrading third party tools to latest versions
Reading packages from /home/u035/project/software/install/tmpbcbio-install/cloudbiolinux/contrib/flavor/ngs_pipeline_minimal/packages-conda.yaml
Creating conda environment: python3
Creating conda environment: samtools0
Creating conda environment: dv
Creating conda environment: python2
Creating conda environment: r36
Creating conda environment: htslib1.10
Checking for problematic or migrated packages in default environment
Initalling initial set of packages for default environment with mamba
# Installing into conda environment default: age-metasv, arriba, bamtools=2.4.0, bamutil, bbmap, bcbio-prioritize, bcbio-variation, bcbio-variation-recall, bcftools, bedops, bedtools=2.27.1, bio-vcf, biobambam, bowtie, bowtie2, break-point-inspector, bwa, bwakit, cage, cancerit-allelecount, chipseq-greylist, cnvkit, coincbc, cramtools, cufflinks, cyvcf2, deeptools, delly, duphold, ensembl-vep=100.*, express, extract-sv-reads, fastp, fastqc>=0.11.8=1, fgbio, freebayes=1.1.0.46, gatk, gatk4, geneimpacts, genesplicer, gffcompare, goleft, grabix, gridss, gsort, gvcfgenotyper, h5py, hmftools-amber, hmftools-cobalt, hmftools-purple, hmmlearn, hts-nim-tools, htslib, impute2, kallisto>=0.43.1, kraken, ldc>=1.13.0, lofreq, macs2, maxentscan, mbuffer, minimap2, mintmap, mirdeep2=2.0.0.7, mirtop, moreutils, multiqc, multiqc-bcbio, ngs-disambiguate, novoalign, octopus>=0.5.1b, oncofuse, optitype>=1.3.4, parallel, pbgzip, peddy, perl-sanger-cgp-battenberg, picard, pindel, pizzly, pyloh, pysam>=0.14.0, pythonpy, qsignature, qualimap, rapmap, razers3=3.5.0, rtg-tools, sailfish, salmon, sambamba, samblaster, samtools=1.10, scalpel, seq2c<2016, seqbuster, seqcluster, seqtk, sickle-trim, simple_sv_annotation, singlecell-barcodes, snap-aligner=1.0dev.97, snpeff=4.3.1t, solvebio, spades, staden_io_lib, star=2.6.1d, stringtie, subread, survivor, tdrmapper, tophat-recondition, trim-galore, ucsc-bedgraphtobigwig, ucsc-bedtobigbed, ucsc-bigbedinfo, ucsc-bigbedsummary, ucsc-bigbedtobed, ucsc-bigwiginfo, ucsc-bigwigsummary, ucsc-bigwigtobedgraph, ucsc-bigwigtowig, ucsc-fatotwobit, ucsc-gtftogenepred, ucsc-liftover, ucsc-wigtobigwig, umis, vardict, vardict-java, variantbam, varscan, vcfanno, vcflib, verifybamid2, viennarna, vqsr_cnn, vt, wham, anaconda-client, awscli, bzip2, ncurses, nodejs, p7zip, readline, s3gof3r, xz, perl-app-cpanminus, perl-archive-extract, perl-archive-zip, perl-bio-db-sam, perl-cgi, perl-dbi, perl-encode-locale, perl-file-fetch, perl-file-sharedir, perl-file-sharedir-install, perl-ipc-system-simple, perl-lwp-protocol-https, perl-lwp-simple, perl-statistics-descriptive, perl-time-hires, perl-vcftools-vcf, bioconductor-annotate, bioconductor-apeglm, bioconductor-biocgenerics, bioconductor-biocinstaller, bioconductor-biocstyle, bioconductor-biostrings, bioconductor-biovizbase, bioconductor-bsgenome.hsapiens.ucsc.hg19, bioconductor-bsgenome.hsapiens.ucsc.hg38, bioconductor-bubbletree, bioconductor-cn.mops, bioconductor-copynumber, bioconductor-degreport, bioconductor-deseq2, bioconductor-dexseq, bioconductor-dnacopy, bioconductor-genomeinfodbdata, bioconductor-genomicranges, bioconductor-iranges, bioconductor-limma, bioconductor-rtracklayer, bioconductor-snpchip, bioconductor-titancna, bioconductor-vsn>=3.50.0, r-base, r-basejump=0.7.2, r-bcbiornaseq>=0.2.7, r-cghflasso, r-chbutils, r-devtools, r-dplyr, r-dt, r-ggdendro, r-ggplot2, r-ggrepel>=0.7, r-gplots, r-gsalib, r-knitr, r-pheatmap, r-plyr, r-pscbs, r-reshape, r-rmarkdown, r-rsqlite, r-sleuth, r-snow, r-stringi, r-viridis>=0.5, r-wasabi, r=3.5.1, xorg-libxt
# Installing into conda environment dv: deepvariant
# Installing into conda environment htslib1.10: mosdepth
# Installing into conda environment python2: bismark, cpat, cutadapt=1.16, dkfz-bias-filter, gemini, gvcf-regions, hap.py, hisat2, htseq=0.9.1, lumpy-sv, manta, metasv, mirge, phylowgs, platypus-variant, sentieon, smcounter2, smoove, strelka, svtools, svtyper, theta2, tophat, vawk, vcf2db
# Installing into conda environment python3: atropos, crossmap
# Installing into conda environment r36: ataqv, bioconductor-purecn>=1.16.0
# Installing into conda environment samtools0: ericscript
Creating manifest of installed packages in /home/u035/project/software/bcbio/manifest
Third party tools upgrade complete.
Upgrading bcbio-nextgen data files
List of genomes to get (from the config file at '{'genomes': [{'dbkey': 'hg38', 'name': 'Human (hg38) full', 'indexes': ['seq', 'twobit', 'bwa', 'hisat2'], 'annotations': ['ccds', 'capture_regions', 'coverage', 'prioritize', 'dbsnp', 'hapmap_snps', '1000g_omni_snps', 'ACMG56_genes', '1000g_snps', 'mills_indels', '1000g_indels', 'clinvar', 'qsignature', 'genesplicer', 'effects_transcripts', 'varpon', 'vcfanno', 'viral', 'gnomad', 'dbnsfp'], 'validation': ['giab-NA12878', 'giab-NA24385', 'giab-NA24631', 'platinum-genome-NA12878', 'giab-NA12878-remap', 'giab-NA12878-crossmap', 'dream-syn4-crossmap', 'dream-syn3-crossmap', 'giab-NA12878-NA24385-somatic', 'giab-NA24143', 'giab-NA24149', 'giab-NA24694', 'giab-NA24695']}], 'genome_indexes': ['bwa', 'rtg'], 'install_liftover': False, 'install_uniref': False}'): Human (hg38) full
Running GGD recipe: hg38 seq 1000g-20150219_1
Running GGD recipe: hg38 bwa 1000g-20150219
Moving on to next genome prep method after trying ggd
GGD recipe not available for hg38 rtg
Downloading genome from s3: hg38 rtg
Moving on to next genome prep method after trying s3
No pre-computed indices for hg38 rtg
Preparing genome hg38 with index rtg
Running GGD recipe: hg38 ccds r20
Running GGD recipe: hg38 capture_regions 20161202
Running GGD recipe: hg38 coverage 2018-10-16
Running GGD recipe: hg38 prioritize 20181227
Running GGD recipe: hg38 dbsnp 153-20180725
Running GGD recipe: hg38 hapmap_snps 20160105
Running GGD recipe: hg38 1000g_omni_snps 20160105
Running GGD recipe: hg38 ACMG56_genes 20160726
Running GGD recipe: hg38 1000g_snps 20160105
Running GGD recipe: hg38 mills_indels 20160105
Running GGD recipe: hg38 1000g_indels 2.8_hg38_20150522
Running GGD recipe: hg38 clinvar 20190513
Running GGD recipe: hg38 qsignature 20160526
Running GGD recipe: hg38 genesplicer 2004.04.03
Running GGD recipe: hg38 effects_transcript 2017-03-16
Running GGD recipe: hg38 varpon 20181105
Running GGD recipe: hg38 vcfanno 20190119
Running GGD recipe: hg38 viral 2017.02.04
Running GGD recipe: hg38 gnomad 3
Traceback (most recent call last):
  File "/home/u035/project/software/bcbio/anaconda/bin/bcbio_nextgen.py", line 228, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/home/u035/project/software/bcbio/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 107, in upgrade_bcbio
  File "/home/u035/project/software/bcbio/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 377, in upgrade_bcbio_data
  File "/home/u035/project/software/install/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/home/u035/project/software/install/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/u035/project/software/install/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 875, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/home/u035/project/software/install/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/home/u035/project/software/install/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/home/u035/project/software/bcbio/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
  File "/home/u035/project/software/bcbio/anaconda/lib/python3.7/subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['bash', '/home/u035/project/software/bcbio/genomes/Hsapiens/hg38/txtmp/ggd-run.sh']' returned non-zero exit status 136.
Checking required dependencies
Installing isolated base python installation
Installing mamba
Installing conda-build
Installing bcbio-nextgen
Installing data and third party dependencies
Traceback (most recent call last):
  File "bcbio_nextgen_install.py", line 290, in <module>
    main(parser.parse_args(), sys.argv[1:])
  File "bcbio_nextgen_install.py", line 51, in main
    subprocess.check_call([bcbio, "upgrade"] + _clean_args(sys_argv, args))
  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/u035/project/software/bcbio/anaconda/bin/bcbio_nextgen.py', 'upgrade', '--tooldir', '/home/u035/project/software/bcbio/tools', '--genomes', 'hg38', '--aligners', 'bwa', '--datatarget', 'variation', '--datatarget', 'gnomad', '--datatarget', 'vep', '--datatarget', 'dbnsfp', '--cores', '64', '--data']' returned non-zero exit status 1
ameynert commented 3 years ago

This appears to be the same issue #3328. I've swapped in the 1.2.0 vt and vcfstream executables which I have on another system entirely, and they work fine.

roryk commented 3 years ago

Thank you so much @ameynert, sorry for being slow in responding. Let me see if I can reproduce this-- we can pin to 1.2.0 but I'd rather either get 1.2.3 fixed or otherwise handle it.

ameynert commented 3 years ago

Thanks @roryk. I've managed to run the ggd recipe standalone with the swapped in version of vt, and copied the output into hg38/variation. Is there anything else that's required for the gnomad datatarget or can I resume my bcbio installation excluding '--datatarget gnomad'?

roryk commented 3 years ago

Thanks, that should work, yes. Thanks so much for figuring out what's going on. I'm pretty sure this problem is related to some htslib 1.9 and htslib 1.10 incompatibility issues, bioconda is kind of in an in-between state right now and some tools are broken on specific versions, working on it now.

roryk commented 3 years ago

Thanks, I can't seem to reproduce this-- could you let me know what verision of samtools is installed?

ameynert commented 3 years ago

Version: 1.10 (using htslib 1.10.2)

It's a clean system with only the bcbio install, no other samtools/htslib installed.

roryk commented 3 years ago

Thanks, what about bcbio_conda list vt what does that show?

roryk commented 3 years ago

Ok, I pinned our bcbio install to samtools 1.9, most of bioconda is built against htslib 1.9 right now-- we have a separate htslib1.10 environment for tools/updates that depend on it so that should fix problems with htslib, but I couldn't make this bug happen with 1.9 or 1.10.

roryk commented 3 years ago

Ok, it was a htslib1.10 issue. I ended up being able to reproduce it. If you do a

bcbio_nextgen.py upgrade -u development --tools

it will pull in samtools 1.9, and this issue should be resolved. Thank you!