10XGenomics / vartrix

Single-Cell Genotyping Tool
MIT License
185 stars 27 forks source link

Calling the same physical positions between sequencing runs #60

Open gertzem opened 3 years ago

gertzem commented 3 years ago

Hi, and thanks for the project. We have a metastasis project in which we sample from different physical sites -- which we call regions -- but it is very plausible due to metastasis that mutations will be shared between regions or private to a regions. The separate regions are separate sequencing runs, and we'd like to have reference allele called in the regions in which the alt allele doesn't appear. But, while I may be missing something, I don't see any way to combine the runs into one mega-run for variant calling. The issue is that the barcodes may be reused between sites. 1) Is there already an option in vartrix which handles this kind of situation? 2) What do you suggest we do in this situation?

pmarks commented 3 years ago

Hi @gertzem,

My recommendation would be to make a unified VCF file that contains the union of all the variants that you've detected. You could do this by merging the BAM files (with samtools merge) and running your variant caller on the whole dataset, or using some tools that can merge VCF files (e.g. bcftools merge).

Then run vartrix against the BAM file of sample separately, but use the unified VCF file. That way each output of vartrix will contain quantification of the reference and alt alleles for all the variants. You'll need to do a bit of work to collate these vartrix results, while keeping track of which sample they came from, but that should be straightforward because they'll be in different outputs.

Does that make sense?

gertzem commented 3 years ago

Thanks for your response. I that may makes sense, but let me write out my understanding of what you said to make sure. I also have a syntactic question about implementing your solution.

You are suggesting that the VCF file passed to vartrix using the the -v option identifies the positions at which the ref/alt alleles will be called. Passing a merged VCF to vartrix using the -v option will not be a problem, even if the particular run does not contain the alt allele at a specific position. Vartrix will not filter that position out.

If my understanding is correct, I'd prefer to do this by merging the VCF files, rather than variant calling on a merged BAM, because I think calling on the merged BAM would unfairly disfavor calling region specific alternative alleles.

Using a merged VCF file leads to a syntactic question. I see that the vcf file I have been using contains one "sample", which happens to be the name of the grandparent directory -- the parent is "outs". The obvious way of merging, which is bcftools merge, would produce a multi-sample file. Will it confuse vartrix off if there is more than one "sample" or if the sample name is not the same as the grandparent directory? Would it ignore any of the samples in a merged file?

pmarks commented 3 years ago

You are suggesting that the VCF file passed to vartrix using the the -v option identifies the positions at which the ref/alt alleles will be called. Passing a merged VCF to vartrix using the -v option will not be a problem, even if the particular run does not contain the alt allele at a specific position. Vartrix will not filter that position out.

Correct.

If my understanding is correct, I'd prefer to do this by merging the VCF files, rather than variant calling on a merged BAM, because I think calling on the merged BAM would unfairly disfavor calling region specific alternative alleles.

There are pros & cons to each approach. Probably the most important thing is to correctly tune the variant caller parameters and do some visual QC of the calls in IGV to be sure you trust them.

Using a merged VCF file leads to a syntactic question. I see that the vcf file I have been using contains one "sample", which happens to be the name of the grandparent directory -- the parent is "outs". The obvious way of merging, which is bcftools merge, would produce a multi-sample file. Will it confuse vartrix off if there is more than one "sample" or if the sample name is not the same as the grandparent directory? Would it ignore any of the samples in a merged file?

The VCF file you give to vartrix just needs to have all the variants combined into a single list. I'm pretty sure vartrix will not mind if it's got more than one sample represented. Vartrix ignores the actual genotypes of each sample, it only cares about the set of possible SNPs/