bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

VerifyBamID failing for exomes #3083

Open ameynert opened 4 years ago

ameynert commented 4 years ago

There aren't enough variants in the 1000g.phase3.100k set for VerifyBamID to work properly for quite a few of the exome samples I've been running. I've generated my own panel using common variants in the CCDS regions; this works fine across all the exome kits I've tested so far, but I have to swap the files in the VerifyBamID resource directory when I go between running the pipeline on whole genomes and exomes. Could the panel used for VerifyBamID be specified in the YAML configuration file?

qc/contamination.py:

    background = {"dataset": "1000g.phase3", # this could get pulled from the config file as well
                  "nvars": "100k",
                  "build":"b38" if dd.get_genome_build(data) == "hg38" else "b37"}
roryk commented 4 years ago

Hi @ameynert,

Sorry about that and sorry for the slow delay in responding. Could we do one better and just use your regions when we are running exomes? Could you pass on that file?

ameynert commented 4 years ago

Hi @roryk Sure - these are based on the TWIST kit regions, which are primarily CCDS, so should be applicable to virtually all exome kits. I've bundled the files into dropbox https://www.dropbox.com/s/kud46rv3chx4h2a/verifybamid_exome.tar.gz?dl=0 - let me know if there are problems with the link.

naumenko-sa commented 4 years ago

Hi @ameynert !

Thanks for contributing the file and sorry that we are stalling here, let us clarify this use case and see how can we move forward.

https://github.com/Griffan/VerifyBamID is a tool to detect DNA contamination and it ships its own resources: 1000g.b38/b37, hgdp.b37/b38 - a relatively small dataset of 56M delivered with the package.

You created a bigger resource dataset for hg38 (2.6G) which works better for a variety of exome panels. This is also based on 1000G data. Does is have more markers? How was it generated? Could you please describe some more, as users would want to know, or post the exact generation script? We could add this to the docs.

How would you comment on the memory usage: https://github.com/Griffan/VerifyBamID#need-more-markers. Did you measure memory usage with the new dataset?

Then we have to:

Would you be able to contribute to steps 2-4?

Sergey

ameynert commented 4 years ago

Hi Sergey,

Generation steps below. Hopefully that's enough - just missing my chr prefix map file.

Cheers, Alison

mkdir 1000G_phase3_hg38
cd 1000G_phase3_hg38
for ((i = 1; i <= 22; i = i + 1))
do
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr${i}.shapeit2_integrated_snvindels_v2a_27022019.GR\
Ch38.phased.vcf.gz;
  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr${i}.shapeit2_integrated_snvindels_v2a_27022019.GR\
Ch38.phased.vcf.gz.tbi
done
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.\
phased.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.\
phased.vcf.gz.tbi
cd ..

# I can't find my chr prefix map file, but it's just no prefix to prefix for autosomes + X

for file in 1000G_phase3_hg38/*vcf.gz
do
  bname=`basename $file`
  bcftools view -R Twist_Exome_Target_hg38.bed -m2 -M2 -v snps -i 'AF >= 0.01' $file | bcftools annotate --rename-chrs chr_prefix_map.txt | bgzip -c > ${bname%.vcf.gz}.biallelic.snps.m\
inAF0.01.vcf.gz
  tabix ${bname%.vcf.gz}.biallelic.snps.minAF0.01.vcf.gz
done

bcftools concat -o ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz -O z \
  ALL.chr[1-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz \
  ALL.chr[12][0-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz \
  ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz
tabix ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz

verifybamid2-1.0.6-0/VerifyBamID \
  --RefVCF ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz
  --Reference bcbio-1.1.5/genomes/Hsapiens/hg38/seq/hg38.fa

mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.bed 1000g.phase3.100k.b38.vcf.gz.dat.bed
mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.mu 1000g.phase3.100k.b38.vcf.gz.dat.mu
mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.PC 1000g.phase3.100k.b38.vcf.gz.dat.V
mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.UD 1000g.phase3.100k.b38.vcf.gz.dat.UD
naumenko-sa commented 4 years ago

Thanks Alison!

We need to push it as we see some errors like this in some samples with the default reference.

[2020-06-19T13:07Z] VerifyBamID contamination checks
[2020-06-19T13:07Z] VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
[2020-06-19T13:07Z]  Version:1.0.6
[2020-06-19T13:07Z]  Copyright (c) 2009-2018 by Hyun Min Kang and Fan Zhang
[2020-06-19T13:07Z]  This project is licensed under the terms of the MIT license.
[2020-06-19T13:07Z] The following parameters are available.  Ones with "[]" are in effect:
[2020-06-19T13:07Z] Available Options
... lines  skipped ...
[2020-06-19T13:07Z]                Model Selection Options : --WithinAncestry,
[2020-06-19T13:07Z]                                          --DisableSanityCheck, --NumPC [2],
[2020-06-19T13:07Z]                                          --FixPC [Empty],
[2020-06-19T13:07Z]                                          --FixAlpha [-1.0e+00],
[2020-06-19T13:07Z]                                          --KnownAF [Empty], --NumThread [4],
[2020-06-19T13:07Z]                                          --Seed [12345], --Epsilon [1.0e-08],
[2020-06-19T13:07Z]                                          --OutputPileup, --Verbose
[2020-06-19T13:07Z]    Construction of SVD Auxiliary Files : --RefVCF [Empty]
[2020-06-19T13:07Z]                     Deprecated Options : --UDPath [Empty], --MeanPath [Empty],
[2020-06-19T13:07Z]                                          --BedPath [Empty]
[2020-06-19T13:07Z] Initialize from FullLLKFunc(int dim, ContaminationEstimator* contPtr)
[2020-06-19T13:07Z] NOTICE - Number of marker in Reference Matrix:100000
[2020-06-19T13:07Z] NOTICE - Number of marker shared with input file:7161
[2020-06-19T13:07Z] NOTICE - Mean Depth:1.017316
[2020-06-19T13:07Z] NOTICE - SD Depth:0.404874
[2020-06-19T13:07Z] NOTICE - 7116 SNP markers remained after sanity check.
[2020-06-19T13:07Z] WARNING -
[2020-06-19T13:07Z] Insufficient Available markers, check input bam depth distribution in output pileup file after specifying --OutputPileup
[2020-06-19T13:07Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/bcbio/anaconda/lib/python3.6/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/bcbio/anaconda/lib/python3.6/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'verifybamid2 1000g.phase3 100k b37 --Reference /bcbio/genomes/Hsapiens/hg19/seq/hg19.fa ...

S

ameynert commented 4 years ago

That's the same type of error I was getting, though minus the uncaught exception. It was just failing silently for me and I was getting MultiQC reports with no entries for contamination estimates for most samples. This was on bcbio 1.1.5.