10XGenomics / vartrix

Single-Cell Genotyping Tool
MIT License
185 stars 27 forks source link

work with 2-strain vcf #36

Closed CarnoZhao closed 4 years ago

CarnoZhao commented 4 years ago

Hi, I'm working with a allele-specific expression analysis, while neither of two strains was the reference strain (mouse: PWK * C57). That is, both of them has SNP sites compared with mm10 ref-genome.

I construct my own VCF file from all-strain vcf file from here, using the PWK base as REF and C57 base as ALT:

my.vcf:
#     pwk   c57
# ... REF   ALT ...
      A     C 
      ...

My problem is, when PWK and mm10 are different at a site, and my bam read is same as PWK:

(ref) pwk: .....A.....
(alt) c57: .....C.....
(fa) mm10: .....C.....
(bam_read): ....A....

Using original vartrix, the ref_hap will be ...C...(from fa-mm10), and the alt_hap will be ...C...(from alt-c57). Now, the read is mapped to ref_hap and alt_hap with same score, leading this read to become UNKOWN. And I got many Variant at index 4631 has multiple unknown reads at barcode index 12943 error.

So, my solution is creating ref_hap from vcf ref base directly, instead of creating it from the fasta reference. I have built my main.rs code and it works fine for my 2-strain problem.

pmarks commented 4 years ago

Hi @CarnoZhao - thanks for your interest.

One general point: the "REF" column of a VCF must match the fasta file -- any tools that use FASTA/VCF/BAM files will expect that to be true. Uou cannot redefine PWK to be REF. (It is technically possible if, but you would need to change the mm10-fa to contain all the PWK alleles -- this is probably not what you want to do).

The correct approach is to make a 'multi-sample' VCF, with one column for PWK and one column for C57. A multi-sample VCF might look like this:

REF   ALT   PWK    C57
C        A      1/1        0/0
G        T      0/0       1/1 

The first line is the example you gave, and the 2nd line is another SNP that is specific to C57. vartrix will give you counts for both variants, but you will need to keep track of which variants are specific to which samples separately.

CarnoZhao commented 4 years ago

Thanks for replying! @pmarks

BTW, what if both 2 strains are different from reference fasta, e.g. mm10: A, pwk: G, c57: C. Will this case be assigned to multi-alleles, right?

pmarks commented 4 years ago

@CarnoZhao - correct vartrix ignores multi-allele VCF entries like this:

#CHROM  POS     ID    REF ALT
1       1581713 .     A   C,G

You can work around this limitation by expanding variants like this to be separate entries for each allele:

#CHROM  POS        ID  REF ALT
1       1581713    .    A     C
1       1581713    .    A     G