Parsing MC counts vs. MS intervals in VCF rows with multiple motifs

bw2 commented 7 months ago

Running TRGT v0.8 on a custom catalog that includes loci with several adjacent motifs, I found this row in the output vcf throws off my parsing script:

chr1    119049087   .   AAAAAATAAAATAAAATAAAATA .   0   .   TRID=1-119049089-119049109-AAAAT-intron;END=119049109;MOTIFS=A,AAAAT;STRUC=(A)n(AAAAT)n GT:AL:ALLR:SD:MC:MS:AP:AM   0/0:23,23:23-24,22-24:18,28:3_4,3_4:0(0-2)_1(2-22)_0(22-23),0(0-2)_1(2-22)_0(22-23):1.000000,1.000000:.,.

In that row,

STRUC=(A)n(AAAAT)n
MC=3_4,3_4
MS=0(0-2)_1(2-22)_0(22-23),0(0-2)_1(2-22)_0(22-23)

IIUC, the MS field represents the 1st haplotype as AA|AAAAT.AAAAT.AAAAT.AAAAT|A. but the reuse of the 0th node in the path (ie. (A)n) at the end of the sequence makes parsing more complicated. Also, this means I can't rely on the MC value to understand the haplotype sequence since MC would be the same if TRGT had instead partitioned the sequence as AAA|AAAAT.AAAAT.AAAAT.AAAAT

I think it would be more intuitive if the output was either

STRUC=(A)n(AAAAT)n(A)n
MC=2_4_1,2_4_1
MS=0(0-2)_1(2-22)_2(22-23),0(0-2)_1(2-22)_2(22-23)

or, if the STRUC is kept the same, then dropping the last node from the path

STRUC=(A)n(AAAAT)n
MC=2_4,2_4
MS=0(0-2)_1(2-22),0(0-2)_1(2-22)

Not sure how common or significant this issue is, but it seems to make parsing more challenging compared to ExpansionHunter outputs. Ideally, there would be a simple way to get the copy numbers and confidence intervals for each motif in the STRUC.

egor-dolzhenko commented 7 months ago

Yes, agreed. We had many discussions about the best way to define these fields. We eventually decided that MC will contain the total count of the corresponding motif (even if some motif copies do not occur contiguously), while MS would provide a more detailed information about which segment(s) of the allele correspond to a given motif. Maybe a good path forward would be to add more fields that summarize the motif counts and spans in different ways?

bw2 commented 7 months ago

I think that would help.. what I'm trying to get to is a table where each row is a repeat locus, and which I could merge across samples to then look for outliers. Even though MS provides the most detailed representation of each haplotype, it's difficult to merge across samples if one sample has lets say (haploid) genotype MS=0(0-5)_1(5-15)_0(15-25) , while another has MS=0(0-5)_1(5-25). For ExansionHunter, merging on locusId+variantId as the row key gave me a table with copy numbers for each atomic locus/motif. To do something similar for TRGT vcfs I'm thinking to either use MC counts or specify all loci as single motif (eg. (GAA)n rather than (A)n(GAA)n).

egor-dolzhenko commented 7 months ago

Yes, agreed, it's difficult to merge on the MS field. Are you thinking about doing something like this using the MC field?

repeat	motifs	sampleA	sampleB
rep1	AT	2	3
rep1	CAG	5	15
rep2	CAG	10	12

Note that since rep1 is composite (say (AT)n(CAG)n), it got split into two rows. Also, this table could either have diploid genotypes or just store the longest allele if this simplification is acceptable.

bw2 commented 7 months ago

Yes. I use 2 kinds of tables in my analyses -

a locus table:

repeat	motifs	sampleA short allele	sampleA long allele	sampleB short allele	sampleB long allele
rep1	AT	2	3	2	5
rep1	CAG	5	15	6	13
rep2	CAG	10	12	9	9

and an allele table:

repeat	motifs	allele	sampleA allele	sampleB allele
rep1	AT	short	2	2
rep1	AT	long	3	5
rep1	CAG	short	5	6
rep1	CAG	long	15	13
rep2	CAG	short	10	9
rep2	CAG	long	12	9

For single-sample VCFs, the tables would have other relevant columns like SD and AP. It would be nice if there was a script to generate these from TRGT vcfs.

PacificBiosciences / trgt

Parsing MC counts vs. MS intervals in VCF rows with multiple motifs #26