bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Different results from different versions of straglr #6

Closed Jesson-mark closed 2 years ago

Jesson-mark commented 2 years ago

Hi, I'm now using straglr to analyze a tandem repeat. BeforeI have installed straglr version 1.1.1 and now I downloaded the newest source codes in the zip format. The newest version is 1.2.0. I ran these two versions to analyze my data and found different results. My motif is CGG and reference copy number is 11. The parameters I specified are --max_str_len 50 --min_str_len 2 and others are default. Specifically, the old version of straglr found 7 reads whose motif copy number is as below:

946.7 843.0 400.3 395.0 18.7 16.7 16.7

And the motif number of clustered allele is 894.8(2);169.5(5). While these numbers are true accoding to my manual inspection, the cluster result is not ideal.

The newer version of straglr found 8 reads whose motif copy number is:

400.3 18.7 16.7 16.7 16.3 16.0 15.0 14.3

And the motif number of clustered allele is 16.2(7). You can see that reads that whose copy number are 946.7 and 843.0 are not reported in newer straglr and newer straglr found some new reads that older straglr didn't find.

I don't understand why there is such difference. Could you explain it?

Thanks!

readmanchiu commented 2 years ago

Thanks for reporting @Jesson-mark One thing I added in v1.2 is to increase the stringency of checking if tandem repeat does occupy most of the "novel" sequence (the expansion sequence if you're doing a genome-scan or the sequence sandwiched between the 2 coordinates provided in genotyping mode). The reason of doing this is to remove cases of retrotransposon insertion (which sometimes consist of tandem repeats flanked by non-repeat sequences) from true repeat expansion cases. I suspect the two reads that got filtered out may have some stretches of sequences, usually near the ends, that are not repeats. You can check the sub-sequences of the 2 reads using the start coordinate and size information from the old result and see if this is the case. I'm more than happy to debug this too if you send me the data (like a bam file of just the locus in question).

readmanchiu commented 2 years ago

And sorry for the delayed response as I'm still away from work. And thanks for reporting this. As there are quite a few things changed, I want to see the effect on others' data before I officially tag it as a new version.

Jesson-mark commented 2 years ago

Thanks for your reply! Sorry for bothering you when you are away from work.

I will follow your suggestion to check what is happening to those 2 reads. If there is any progress or problem I will let you know.

Best wishes!