Xinglab / TideHunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
https://github.com/yangao07/TideHunter
MIT License
20 stars 2 forks source link

subPos & match score feature request #5

Open zztin opened 4 years ago

zztin commented 4 years ago

Hi Gao, I tried to retrieve the repeated subunits from the long read and feed it into other consensus calling methods (such as Medaka by ONT or majority voting).

According to the README: subPos: start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by ",", all coordinates are 1-based.

  1. In some reads, multiple consensus sequences of different lengths are reported with (completely) overlaying regions. Is it possible to include a column to report the overall alignment score of the subunits?
    • I see there is a criterion to filter by maximum divergence rate between two consecutive repeats, but this does not necessarily report the quality of the overall consensus. Is this a correct intepretation? Is there a possibility to add a score to report the divergence rate of all repeats to the consensus sequence?

Thank you very much!!

yangao07 commented 4 years ago

Thanks for your comments and questions.

Yan

zztin commented 4 years ago

Hi Yan,

yangao07 commented 4 years ago

Not sure if I understand your question correctly. You could align one consensus to another consensus sequence, see if they have enough matched bases. Since each consensus sequence may start from any position of the target sequence, you can append one more copy to each consensus, and align the two copies to each other.

yangao07 commented 4 years ago

Check out the -u/--unit-seq option in the latest release: v1.4.0. It will give you all the unit sequences of each tandem repeat.

zztin commented 4 years ago

Hi Yan, Thank you for the new feature --unit-seq. I tried it out it looks good! I have a question about the avgMatch score. The test_50x4 example gives a score of 98.0 while the sequences are exactly the same to each other. Is this expected? Is the aveMatch score on a scale of 0 - 100 (%)?

Another unrelated question is that in this example, the 4 repeats starts at 51, 101, 151, 201. I would expect the subPos as 51, 101, 151, 201, 250 instead of 101,151,201,250. Is this always the case if the first repeat is not included in the tandem repeat subPos list even if they are complete? Or did I misinterpreted something?

Thank you very much!

output of the test_50x4.fa test case:

>test_50x4_rep0_300_51_250_50_4.0_98.0_0_101,151,201,250
CAGCTAGTCGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGAT
yangao07 commented 4 years ago

You are right. The sequence was shfitted by 1 bp. I will fix this bug soon. Thanks for pointting it out.

yangao07 commented 4 years ago

Just updated to v1.4.1. Please try the new version.

yangao07 commented 4 years ago

For your other questions:

yangao07 commented 4 years ago

I think It is feasible to derive a set of subPos that includes as many units as possible. The same to the subPos for full-length consensus sequences. I will work on that.