Open bw2 opened 7 months ago
Yes, agreed. We had many discussions about the best way to define these fields. We eventually decided that MC
will contain the total count of the corresponding motif (even if some motif copies do not occur contiguously), while MS
would provide a more detailed information about which segment(s) of the allele correspond to a given motif. Maybe a good path forward would be to add more fields that summarize the motif counts and spans in different ways?
I think that would help.. what I'm trying to get to is a table where each row is a repeat locus, and which I could merge across samples to then look for outliers. Even though MS provides the most detailed representation of each haplotype, it's difficult to merge across samples if one sample has lets say (haploid) genotype MS=0(0-5)_1(5-15)_0(15-25)
, while another has MS=0(0-5)_1(5-25)
. For ExansionHunter, merging on locusId+variantId as the row key gave me a table with copy numbers for each atomic locus/motif. To do something similar for TRGT vcfs I'm thinking to either use MC counts or specify all loci as single motif (eg. (GAA)n
rather than (A)n(GAA)n
).
Yes, agreed, it's difficult to merge on the MS
field. Are you thinking about doing something like this using the MC
field?
repeat | motifs | sampleA | sampleB |
---|---|---|---|
rep1 | AT | 2 | 3 |
rep1 | CAG | 5 | 15 |
rep2 | CAG | 10 | 12 |
Note that since rep1
is composite (say (AT)n(CAG)n
), it got split into two rows. Also, this table could either have diploid genotypes or just store the longest allele if this simplification is acceptable.
Yes. I use 2 kinds of tables in my analyses -
a locus table:
repeat | motifs | sampleA short allele | sampleA long allele | sampleB short allele | sampleB long allele |
---|---|---|---|---|---|
rep1 | AT | 2 | 3 | 2 | 5 |
rep1 | CAG | 5 | 15 | 6 | 13 |
rep2 | CAG | 10 | 12 | 9 | 9 |
and an allele table:
repeat | motifs | allele | sampleA allele | sampleB allele |
---|---|---|---|---|
rep1 | AT | short | 2 | 2 |
rep1 | AT | long | 3 | 5 |
rep1 | CAG | short | 5 | 6 |
rep1 | CAG | long | 15 | 13 |
rep2 | CAG | short | 10 | 9 |
rep2 | CAG | long | 12 | 9 |
For single-sample VCFs, the tables would have other relevant columns like SD and AP. It would be nice if there was a script to generate these from TRGT vcfs.
Running TRGT v0.8 on a custom catalog that includes loci with several adjacent motifs, I found this row in the output vcf throws off my parsing script:
In that row,
IIUC, the MS field represents the 1st haplotype as
AA|AAAAT.AAAAT.AAAAT.AAAAT|A
. but the reuse of the 0th node in the path (ie. (A)n) at the end of the sequence makes parsing more complicated. Also, this means I can't rely on the MC value to understand the haplotype sequence since MC would be the same if TRGT had instead partitioned the sequence asAAA|AAAAT.AAAAT.AAAAT.AAAAT
I think it would be more intuitive if the output was either
or, if the STRUC is kept the same, then dropping the last node from the path
Not sure how common or significant this issue is, but it seems to make parsing more challenging compared to ExpansionHunter outputs. Ideally, there would be a simple way to get the copy numbers and confidence intervals for each motif in the STRUC.