bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Why is detecting and genotyping Short Tandem Repeats (STRs) challenging? #29

Closed CSU-KangHu closed 2 months ago

CSU-KangHu commented 7 months ago

Hi, Thank you for developing the excellent straglr tool. I've read your paper "Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences". I'm still confused by the following questions:

  1. straglr requires specifying the parameter --loci: a BED file containing loci to be genotyped. Since we know the structure and location of motifs on the reference genome, what are the challenges in detecting motifs and their repeat counts in reads?

  2. How can we evaluate the performance between various STR detection and genotyping tools? Is there a gold standard dataset available to evaluate the sensitivity and precision of different tools?

readmanchiu commented 7 months ago

Thanks for your interest in our software. Answers to your questons

  1. I think the challenge comes from the noise (sequencing errors) in the long reads and often the impurity of the repeats themselves. Proving the bed file with repeat information is for using Straglr to genotype target regions. The other main application is to use Straglr to detect repeat expansions from whole-genome sequencing.
  2. There is not a gold standard dataset around as far as I know. The PacBio cell lie dataset I indicated in the paper is still available. Many software were able to access patient data to perform benchmarking, but unfortunately won't make their data public because they come from real individuals