bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Calling complex STR patterns #35

Closed bartcharbon closed 1 month ago

bartcharbon commented 3 months ago

Dear @readmanchiu,

Some of the STR's we are interested in have a more complex pattern instead of a simple repeating sequence. e.g. something like (TTTTA){5}TTA(TTCTA){5} (5 TTTTA's followed by a single TTA and then 5 TTCTA's)

Is this something Straglr would be able to do? And if so what would be the correct way te specify a unit like this in the loci bed file?

And a related question: is there documentation on how to specify the repeat pattern? I've been playing around with "*" and "+" signs, also in combination with brackets and curly brackets. Are constructions with these tokens supported?

readmanchiu commented 3 months ago

Hi @bartcharbon,

Thanks for looking into Straglr. Complex patterns with interruptions is a toughie. From my experience the reads always deviate from the complex pattern specified, like I don't think you would expect all your reads will show (TTTTA){5}TTA(TTCTA){5}, or even (TTTTA){n}TTA(TTCTA){n}. The best Straglr can do right now is if you specify the expected motif as TT*, so all the 3 motifs will hopefully be captured. The "actual_motif" field in the TSV output will tell you what TRF think the motif is. I have been made aware of some software that's trying to delineate the complex repeat pattern. Here's one: https://academic.oup.com/bioinformatics/article/39/4/btad185/7114028?login=false https://github.com/morisUtokyo/uTR I haven't tried it myself and would like to know if it's any good. Anyways, you can try Straglr with the TT* regex and give uTR a whirl, please let me know the results of both if you do!