PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
106 stars 8 forks source link

Parsing MC counts vs. MS intervals in VCF rows with multiple motifs #26

Open bw2 opened 7 months ago

bw2 commented 7 months ago

Running TRGT v0.8 on a custom catalog that includes loci with several adjacent motifs, I found this row in the output vcf throws off my parsing script:

chr1    119049087   .   AAAAAATAAAATAAAATAAAATA .   0   .   TRID=1-119049089-119049109-AAAAT-intron;END=119049109;MOTIFS=A,AAAAT;STRUC=(A)n(AAAAT)n GT:AL:ALLR:SD:MC:MS:AP:AM   0/0:23,23:23-24,22-24:18,28:3_4,3_4:0(0-2)_1(2-22)_0(22-23),0(0-2)_1(2-22)_0(22-23):1.000000,1.000000:.,.

In that row,

STRUC=(A)n(AAAAT)n
MC=3_4,3_4
MS=0(0-2)_1(2-22)_0(22-23),0(0-2)_1(2-22)_0(22-23)

IIUC, the MS field represents the 1st haplotype as AA|AAAAT.AAAAT.AAAAT.AAAAT|A. but the reuse of the 0th node in the path (ie. (A)n) at the end of the sequence makes parsing more complicated. Also, this means I can't rely on the MC value to understand the haplotype sequence since MC would be the same if TRGT had instead partitioned the sequence as AAA|AAAAT.AAAAT.AAAAT.AAAAT

I think it would be more intuitive if the output was either

STRUC=(A)n(AAAAT)n(A)n
MC=2_4_1,2_4_1
MS=0(0-2)_1(2-22)_2(22-23),0(0-2)_1(2-22)_2(22-23)

or, if the STRUC is kept the same, then dropping the last node from the path

STRUC=(A)n(AAAAT)n
MC=2_4,2_4
MS=0(0-2)_1(2-22),0(0-2)_1(2-22)

Not sure how common or significant this issue is, but it seems to make parsing more challenging compared to ExpansionHunter outputs. Ideally, there would be a simple way to get the copy numbers and confidence intervals for each motif in the STRUC.

egor-dolzhenko commented 7 months ago

Yes, agreed. We had many discussions about the best way to define these fields. We eventually decided that MC will contain the total count of the corresponding motif (even if some motif copies do not occur contiguously), while MS would provide a more detailed information about which segment(s) of the allele correspond to a given motif. Maybe a good path forward would be to add more fields that summarize the motif counts and spans in different ways?

bw2 commented 7 months ago

I think that would help.. what I'm trying to get to is a table where each row is a repeat locus, and which I could merge across samples to then look for outliers. Even though MS provides the most detailed representation of each haplotype, it's difficult to merge across samples if one sample has lets say (haploid) genotype MS=0(0-5)_1(5-15)_0(15-25) , while another has MS=0(0-5)_1(5-25). For ExansionHunter, merging on locusId+variantId as the row key gave me a table with copy numbers for each atomic locus/motif. To do something similar for TRGT vcfs I'm thinking to either use MC counts or specify all loci as single motif (eg. (GAA)n rather than (A)n(GAA)n).

egor-dolzhenko commented 7 months ago

Yes, agreed, it's difficult to merge on the MS field. Are you thinking about doing something like this using the MC field?

repeat motifs sampleA sampleB
rep1 AT 2 3
rep1 CAG 5 15
rep2 CAG 10 12

Note that since rep1 is composite (say (AT)n(CAG)n), it got split into two rows. Also, this table could either have diploid genotypes or just store the longest allele if this simplification is acceptable.

bw2 commented 7 months ago

Yes. I use 2 kinds of tables in my analyses -

a locus table:

repeat motifs sampleA short allele sampleA long allele sampleB short allele sampleB long allele
rep1 AT 2 3 2 5
rep1 CAG 5 15 6 13
rep2 CAG 10 12 9 9

and an allele table:

repeat motifs allele sampleA allele sampleB allele
rep1 AT short 2 2
rep1 AT long 3 5
rep1 CAG short 5 6
rep1 CAG long 15 13
rep2 CAG short 10 9
rep2 CAG long 12 9

For single-sample VCFs, the tables would have other relevant columns like SD and AP. It would be nice if there was a script to generate these from TRGT vcfs.