PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
103 stars 7 forks source link

de novo repeat motifs #32

Closed MKaandemir closed 3 months ago

MKaandemir commented 4 months ago

Hi,

I am curious about the functionality of TRGT regarding tandem repeat motifs. Can TRGT identify de novo tandem repeat motifs in samples, or does it strictly use the motifs present in the repeat catalog?

Thank you!

egor-dolzhenko commented 4 months ago

Thanks for the question! Yes, TRGT only uses motifs present in the repeat catalog. However we recently implemented a tool that can help with identification of novel motifs. For example, you could extract the allele sequences reported by TRGT for a repeat of interest and then run them through this tool.

MKaandemir commented 4 months ago

Thanks for the answer. I really appreciate it. I also wonder if the reported allele sequence can change if I change the order of motifs in the repeat catalog?

egor-dolzhenko commented 4 months ago

Happy to help! The allele sequences are not dependent on the specified motifs, so they shouldn't change. However, the reported motif counts could change in principle. One example is when you have an allele composed of a new, unknown motif that matches two known motifs equally well. It's best to keep the order of motifs the same in all analyses.

MKaandemir commented 4 months ago

We are constraining an unknown repeat to match one of the specified motifs in our repeat catalog. To discover new repeat motifs, the tr-solve tool is required, correct? Why isn't this feature implemented in the trgt tool?

I also wonder how trgt address mosaicism in this sequence:

ACGACGACGACGACTACTACTACTACGACGACGACG

Would you consider it as "ACG, ACT 8_5" or "ACG 4 ACT 5 ACG 4"?

egor-dolzhenko commented 4 months ago

It seems that identifying de novo motifs should be done at the population level instead of the single sample level with TRGT. There are many messy low complexity regions where it's not clear what the right motifs should be and hence relatively small changes in the allele sequence may result in different motif sets.

As to your last question. The MC VCF field contains the overall count of each motif (even if the motif run is interrupted by another sequence) while the MS field lists the span of each uninterrupted motif run.

Note that the contents of MC and MS fields are based on HMM segmentation and hence allow for imperfect motif copies. If you are only interested in studying perfect motif occurrences, you could get those directly from the allele sequences reported by TRGT.

MKaandemir commented 3 months ago

Thanks for the help!