bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

What does parameter `min_support` mean in genotyping mode of Straglr? #7

Closed Jesson-mark closed 2 years ago

Jesson-mark commented 2 years ago

Hi,

Recently I have used Straglr to genotype a STR of my data. And I found a quite strange phenomenon. With the default parameters, I got these copy numbers of TR of several reads:

946.7 843 400.3 395 18.7 16.7 16.7

and the copy numbers of two clusters(alleles) are 894.8(the first two reads) and 169.5(other reads). It seems reasonable.

But when I turned min_cluster_size to 3 or 4, the results are same as that of 2! Which means even if min_cluster_size is bigger than 2, the first two reads are still clustered together. That's strange.

After some simple inspections of TREFinder class, I found that min_cluster_size is affected by min_support, shown in the code below:

self.min_cluster_size = min_cluster_size if min_cluster_size < self.min_support else self.min_support

I don't know why min_cluster_size cannot be larger than min_support. Since it is useful in genome-scan mode, what does it mean in genotyping mode?

Thank you for taking the time. Looking forward to your reply ! Jesson-mark

readmanchiu commented 2 years ago

Hi @Jesson-mark, very sorry for the late reply, I somehow missed my alert emails from github. Thanks very for reporting issue. min_support is the total number reads that carry the repeat fulfilling whatever conditions the user specified (e.g. min_str_len, max_str_len, min_ins_size, etc). min_cluster_size is the minimum number of reads clustered for each allele. The min_cluster_size was added after the first release in scenarios when I noticed usually very big alleles will have fewer support reads than smaller alleles and I still want to group them together. For example, I might set min_support to 10, and out of 12 support reads identified, only 4 carry a big allele. So if I set min_cluster_size to 2, the 4 reads will still be clustered together. Yes, this is intended for genome-scan mode, scenarios may be different in genotyping mode, as in your case. I will probably remove this requirement, and add a checking to make sure cluster_size cannot be larger than the actual number of support reads instead. I will think about it more and update you on any change I decide to make shortly.

readmanchiu commented 2 years ago

OK, I just enabled the min_cluster_size to be larger than min_support for genotyping mode in v1.2.0. I won't allow this for genome scan as this will potentially remove many loci from being detected. Thanks for raising the issue.

Jesson-mark commented 2 years ago

Thanks for your considerate explanations!

I will try the newest Straglr for my data. If there is any question I will report it. Thanks again for you modifications to Straglr.