PacificBiosciences / trgt

Tandem repeat genotyping and visualization from PacBio HiFi data
Other
93 stars 7 forks source link

Why is detecting and genotyping Short Tandem Repeats (STRs) challenging? #21

Open CSU-KangHu opened 7 months ago

CSU-KangHu commented 7 months ago

Hi, Thank you for developing the excellent TRGT tool. I've read your paper "Resolving the unsolved: Comprehensive assessment of tandem repeats at scale". To gain a better understanding, I've also read several other papers on STR detection and genotyping. However, I'm still confused by the following questions:

  1. TRGT requires specifying the parameter --repeats <REPEATS> BED file with reference coordinates and the structure of tandem repeats. Since we know the structure and location of motifs on the reference genome, what are the challenges in detecting motifs and their repeat counts in reads? What distinguishes TRGT from existing tools like straglr and RepeatHMM?

  2. How can we evaluate the performance between TRGT and various STR detection and genotyping tools? Are there established and reliable benchmark datasets available for this purpose?

egor-dolzhenko commented 7 months ago

Thank you for the questions. In many cases identifying and counting motifs in reads is straightforward. But sometimes it gets more complicated because of mosaicim, sequence composition changes, nested repeats, etc... Different tools resolve these challenges in different ways and may also be designed to profile different kinds of repeats. It would makes sense to pick a tool that best aligns with the needs of your project. As for benchmarking, here is a recent paper that proposes a new benchmarking framework designed specifically for tandem repeats. Many groups that work on repeat expansions also sequence some samples with known expansions of repeats they are interested in and then confirm that their tool of choice can detect them. I hope this response is helpful!

minghuaxu commented 6 months ago

Hi @egor-dolzhenko, Thank you for presenting the TR catalog and variant benchmark files in the 'Benchmarking of small and large variants across tandem repeats' paper. I am curious whether Truvari refine method can evaluate the performance of TR call tools such as TRGT and straglr, enabling the assessment of metrics like accuracy, F1 score, and recall for these tools. Given that the groundtruth VCF file lacks motif and number of repeats information, the Truvari refine method evaluates TR called results based on sequence similarity, size similarity, and other indicators. Whether it is possible to evaluate the accuracy of motifs and number of repeats in the TR call set?

egor-dolzhenko commented 6 months ago

Thank you for the question. Yes, Truvari is a sequence-level benchmark. In my opinion, evaluating the accuracy of motifs and their counts is a much more elusive task. For example, some tools might count only the exact motif copies while other tools might also detect imperfect motifs. Because of this, different tools might produce very different motif counts which would all be "correct". When it comes to resolving motif counts, it might be best to do a project-specific benchmarking study and pick a tool that best fits the needs of the specific project.