Open zztin opened 4 years ago
Thanks for your comments and questions.
subPos
column, I updated the README file and added examples to illustrate how the coordinates are defined.
Right now, it is not easy to output the coordinates of "target sequence" instead of the tandem repeats, since it needs accurate alignment to determine how the "target sequence" is contained in each tandem repeat unit. We may implement this in the future.score
or divergence rate
. We could add a column of average accuracy
, calculated based on the alignment between each repeat unit and the consensus sequence. Will this work for you?Yan
Hi Yan,
Not sure if I understand your question correctly. You could align one consensus to another consensus sequence, see if they have enough matched bases. Since each consensus sequence may start from any position of the target sequence, you can append one more copy to each consensus, and align the two copies to each other.
Check out the -u/--unit-seq
option in the latest release: v1.4.0.
It will give you all the unit sequences of each tandem repeat.
Hi Yan, Thank you for the new feature --unit-seq. I tried it out it looks good! I have a question about the avgMatch score. The test_50x4 example gives a score of 98.0 while the sequences are exactly the same to each other. Is this expected? Is the aveMatch score on a scale of 0 - 100 (%)?
Another unrelated question is that in this example, the 4 repeats starts at 51, 101, 151, 201.
I would expect the subPos
as 51, 101, 151, 201, 250
instead of 101,151,201,250
. Is this always the case if the first repeat is not included in the tandem repeat subPos
list even if they are complete? Or did I misinterpreted something?
Thank you very much!
output of the test_50x4.fa test case:
>test_50x4_rep0_300_51_250_50_4.0_98.0_0_101,151,201,250
CAGCTAGTCGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGAT
You are right. The sequence was shfitted by 1 bp. I will fix this bug soon. Thanks for pointting it out.
Just updated to v1.4.1. Please try the new version.
For your other questions:
percentage
of # matched bases over the total length of each unit, so it is 0~100 (%).subPos
information is based on the kmer matches, so it is not pointing to the very start position of the first tandem repeat unit, which is expected. Since there may not be enough matched kmers around that start position. The start
and end
information, which are 51 and 101 in this toy example, denote the start and end coordinate of the whole tandem repeat. To obtain these two positions, TideHunter aligns the generated consensus sequence back to the raw read.I think It is feasible to derive a set of subPos
that includes as many units as possible.
The same to the subPos
for full-length consensus sequences.
I will work on that.
Hi Gao, I tried to retrieve the repeated subunits from the long read and feed it into other consensus calling methods (such as Medaka by ONT or majority voting).
According to the README: subPos: start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by ",", all coordinates are 1-based.
Thank you very much!!