Griffan / VerifyBamID

VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
http://griffan.github.io/VerifyBamID/
94 stars 15 forks source link

ReferencePanel.vcf.gz #40

Closed yinbinqiu closed 2 years ago

yinbinqiu commented 2 years ago

Hi, I have a batch of data that I need to test for contamination, how do I prepare the ReferencePanel.vcf.gz file?

Griffan commented 2 years ago

@yinbinqiu if you are checking human samples, the ref panel files in the resource directory(in this repo) should be sufficient. It contains ref panel files for both hg19 and hg38 reference coordinates. Let me know if this doesn't work.

yinbinqiu commented 2 years ago

Thanks for your reply, I understand about the use of ReferencePanel.vcf.gz file and will try it subsequently. Suppose I have 200 samples, all obtained snp results (vcf format files), when using VerifyBamID2, do I need to merge these vcf files and how? Also, do the bam files need to be merged? In short, how are the vcf files and bam files designed for each run?

Griffan commented 2 years ago

@yinbinqiu Are the VCF files obtained from an external source, e.g. chip data, on the same set of samples? If that's the case, it's recommended to directly use VB1 which has this known genotype(external VCF genotype) mode. Otherwise, your VCF should be generated from a subsequent step after ruling out potential contamination events in your bam files. VB2 considers one bam at a time. It's not recommended to merge the bams. If you want to build the reference panel resources using your 200 samples' VCF files, you can refer to 1000g or TOPMed pipeline to produce the high-quality VCFs. And then you can come back to the readme page to run VB2 to construct your customized resource files. Does this answer your question?

yinbinqiu commented 2 years ago

Hi, your explanation made it clearer to me that building my own resource file is a preparation for the vcf file, and now am trying to use the resource file provided by VB2 (in the VerifyBamID/resource/ folder) first. Sorry, I may not have been clear enough earlier. Now I have a batch of samples (from multiple individuals, no blood relationship between individuals, but samples have tumor and normal of the same individual) we have obtained data using next generation sequencing technology, call snp using a common set of methods and obtained vcf files, this batch of samples want to use VB or VB2 to detect if there are mixed samples or contamination.

$(VERIFY_BAM_ID_HOME)/bin/VerifyBamID --BamFile /xx/xx/sample001.bam --UDPath /xx/xx/1000g.phase3.10k.b37.vcf.gz.dat.UD --BedPath /xx/xx /1000g.phase3.10k.b37.vcf.gz.dat.bed --MeanPath /xx/xx/1000g.phase3.10k.b37.vcf.gz.dat.mu --Reference /xx/xx/hg19.fa

Is it appropriate to use it like the above? (where sample001.bam is the bam file obtained after comparison, sorting, de-duplication, etc. of one of the samples)

Griffan commented 2 years ago

Yes, you can use this cmdline. And here is the simplified version of it: "--SVDPrefix /xx/xx /1000g.phase3.10k.b37.vcf.gz.dat" has replaced "--UDPath", "--MeanPath " and "--BedPath"

yinbinqiu commented 2 years ago

Thanks.

yinbinqiu commented 2 years ago

Yes, you can use this cmdline. And here is the simplified version of it: "--SVDPrefix /xx/xx /1000g.phase3.10k.b37.vcf.gz.dat" has replaced "--UDPath", "--MeanPath " and "--BedPath"

If this cmdline is used, only for a single sample? How do I know which samples are mixed between?

Griffan commented 2 years ago

Yes, this is for one single sample. Theoretically, we can output two genotype VCFs for both the intended sample and contaminating sample(assuming only two samples are involved) when a high-level contamination event occurs. And you can use these genotype VCFs to match with other samples' genotype VCFs in your sample set. But in most contamination events, the contamination level in practice is not high enough to give you an accurate genotype calling for the contaminating sample.

yinbinqiu commented 2 years ago

That is, when the contamination level is relatively low, VB or VB2 cannot distinguish the source of sample contamination (e.g. to get the result: sample A's contamination comes from sample B and sample C's contamination also comes from sample B), how high can the contamination level be distinguished? Also, I'm still a bit confused by the explanation of README.md in github, how to build commands to run between two samples or multiple samples when the contamination level is unknown (when I suspect the sample has high contamination and want to try to run VB first)? Another question is how do you distinguish between nucleotide polymorphism and contamination?

Griffan commented 2 years ago

We didn't explore that option. VB2 can help you estimate contamination level in a bam file of a certain sample, it doesn't require the potential candidate contaminating source to be provided.(For your case, VB1 provided the option to compare current bam with candidate source genotypes) When contamination occurs, the allele fraction will drift from 0,0.5 and 1, for Hom_ref, Het, and Hom_alt respectively.

yinbinqiu commented 2 years ago

Thanks for your reply.