Closed yinbinqiu closed 2 years ago
@yinbinqiu if you are checking human samples, the ref panel files in the resource directory(in this repo) should be sufficient. It contains ref panel files for both hg19 and hg38 reference coordinates. Let me know if this doesn't work.
Thanks for your reply, I understand about the use of ReferencePanel.vcf.gz file and will try it subsequently. Suppose I have 200 samples, all obtained snp results (vcf format files), when using VerifyBamID2, do I need to merge these vcf files and how? Also, do the bam files need to be merged? In short, how are the vcf files and bam files designed for each run?
@yinbinqiu Are the VCF files obtained from an external source, e.g. chip data, on the same set of samples? If that's the case, it's recommended to directly use VB1 which has this known genotype(external VCF genotype) mode. Otherwise, your VCF should be generated from a subsequent step after ruling out potential contamination events in your bam files. VB2 considers one bam at a time. It's not recommended to merge the bams. If you want to build the reference panel resources using your 200 samples' VCF files, you can refer to 1000g or TOPMed pipeline to produce the high-quality VCFs. And then you can come back to the readme page to run VB2 to construct your customized resource files. Does this answer your question?
Hi, your explanation made it clearer to me that building my own resource file is a preparation for the vcf file, and now am trying to use the resource file provided by VB2 (in the VerifyBamID/resource/ folder) first. Sorry, I may not have been clear enough earlier. Now I have a batch of samples (from multiple individuals, no blood relationship between individuals, but samples have tumor and normal of the same individual) we have obtained data using next generation sequencing technology, call snp using a common set of methods and obtained vcf files, this batch of samples want to use VB or VB2 to detect if there are mixed samples or contamination.
$(VERIFY_BAM_ID_HOME)/bin/VerifyBamID --BamFile /xx/xx/sample001.bam --UDPath /xx/xx/1000g.phase3.10k.b37.vcf.gz.dat.UD --BedPath /xx/xx /1000g.phase3.10k.b37.vcf.gz.dat.bed --MeanPath /xx/xx/1000g.phase3.10k.b37.vcf.gz.dat.mu --Reference /xx/xx/hg19.fa
Is it appropriate to use it like the above? (where sample001.bam is the bam file obtained after comparison, sorting, de-duplication, etc. of one of the samples)
Yes, you can use this cmdline. And here is the simplified version of it: "--SVDPrefix /xx/xx /1000g.phase3.10k.b37.vcf.gz.dat" has replaced "--UDPath", "--MeanPath " and "--BedPath"
Thanks.
Yes, you can use this cmdline. And here is the simplified version of it: "--SVDPrefix /xx/xx /1000g.phase3.10k.b37.vcf.gz.dat" has replaced "--UDPath", "--MeanPath " and "--BedPath"
If this cmdline is used, only for a single sample? How do I know which samples are mixed between?
Yes, this is for one single sample. Theoretically, we can output two genotype VCFs for both the intended sample and contaminating sample(assuming only two samples are involved) when a high-level contamination event occurs. And you can use these genotype VCFs to match with other samples' genotype VCFs in your sample set. But in most contamination events, the contamination level in practice is not high enough to give you an accurate genotype calling for the contaminating sample.
That is, when the contamination level is relatively low, VB or VB2 cannot distinguish the source of sample contamination (e.g. to get the result: sample A's contamination comes from sample B and sample C's contamination also comes from sample B), how high can the contamination level be distinguished? Also, I'm still a bit confused by the explanation of README.md in github, how to build commands to run between two samples or multiple samples when the contamination level is unknown (when I suspect the sample has high contamination and want to try to run VB first)? Another question is how do you distinguish between nucleotide polymorphism and contamination?
We didn't explore that option. VB2 can help you estimate contamination level in a bam file of a certain sample, it doesn't require the potential candidate contaminating source to be provided.(For your case, VB1 provided the option to compare current bam with candidate source genotypes) When contamination occurs, the allele fraction will drift from 0,0.5 and 1, for Hom_ref, Het, and Hom_alt respectively.
Thanks for your reply.
Hi, I have a batch of data that I need to test for contamination, how do I prepare the ReferencePanel.vcf.gz file?