subPos & match score feature request

zztin commented 4 years ago

Hi Gao, I tried to retrieve the repeated subunits from the long read and feed it into other consensus calling methods (such as Medaka by ONT or majority voting).

According to the README: subPos: start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by ",", all coordinates are 1-based.

Problems I faced:
1. When 5' and 3' primers are given, the subPos is the start of the tandem repeat sequence, not the start of the targeted sequence. However, the length is the targeted sequence length. The tandem repeat length is not reported.
Is it possible to report the start location at the position where the target sequence starts instead of the whole tandem repeat?
Is it possible to include the (start, end) position of each sub-unit? Or to have an option to export all the repeat subunits in a fastq file (with identifiable read name such as >readname_consX_repY).

In some reads, multiple consensus sequences of different lengths are reported with (completely) overlaying regions. Is it possible to include a column to report the overall alignment score of the subunits?
- I see there is a criterion to filter by maximum divergence rate between two consecutive repeats, but this does not necessarily report the quality of the overall consensus. Is this a correct intepretation? Is there a possibility to add a score to report the divergence rate of all repeats to the consensus sequence?

Thank you very much!!

yangao07 commented 4 years ago

Thanks for your comments and questions.

For the issues related to the subPos column, I updated the README file and added examples to illustrate how the coordinates are defined. Right now, it is not easy to output the coordinates of "target sequence" instead of the tandem repeats, since it needs accurate alignment to determine how the "target sequence" is contained in each tandem repeat unit. We may implement this in the future.
For the score or divergence rate. We could add a column of average accuracy, calculated based on the alignment between each repeat unit and the consensus sequence. Will this work for you?

Yan

zztin commented 4 years ago

Hi Yan,

I understand.
Yes, that would be nice!
If I have several consensuses derived from one long nanopore read, would you recommend a method to access if these consensus reads are actually the same sequence but got split up? ( What I do now is align them to genome sequence, but wondering if you have some reference free ideas?) In this figure, the blue are sense strand repeats and red anti-sense

yangao07 commented 4 years ago

Not sure if I understand your question correctly. You could align one consensus to another consensus sequence, see if they have enough matched bases. Since each consensus sequence may start from any position of the target sequence, you can append one more copy to each consensus, and align the two copies to each other.

yangao07 commented 4 years ago

Check out the -u/--unit-seq option in the latest release: v1.4.0. It will give you all the unit sequences of each tandem repeat.

zztin commented 4 years ago

Hi Yan, Thank you for the new feature --unit-seq. I tried it out it looks good! I have a question about the avgMatch score. The test_50x4 example gives a score of 98.0 while the sequences are exactly the same to each other. Is this expected? Is the aveMatch score on a scale of 0 - 100 (%)?

Another unrelated question is that in this example, the 4 repeats starts at 51, 101, 151, 201. I would expect the subPos as 51, 101, 151, 201, 250 instead of 101,151,201,250. Is this always the case if the first repeat is not included in the tandem repeat subPos list even if they are complete? Or did I misinterpreted something?

Thank you very much!

output of the test_50x4.fa test case:

>test_50x4_rep0_300_51_250_50_4.0_98.0_0_101,151,201,250
CAGCTAGTCGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGAT

yangao07 commented 4 years ago

You are right. The sequence was shfitted by 1 bp. I will fix this bug soon. Thanks for pointting it out.

yangao07 commented 4 years ago

Just updated to v1.4.1. Please try the new version.

yangao07 commented 4 years ago

For your other questions:

The aveMatch score is the average percentage of # matched bases over the total length of each unit, so it is 0~100 (%).
The subPos information is based on the kmer matches, so it is not pointing to the very start position of the first tandem repeat unit, which is expected. Since there may not be enough matched kmers around that start position. The start and end information, which are 51 and 101 in this toy example, denote the start and end coordinate of the whole tandem repeat. To obtain these two positions, TideHunter aligns the generated consensus sequence back to the raw read.

yangao07 commented 4 years ago

I think It is feasible to derive a set of subPos that includes as many units as possible. The same to the subPos for full-length consensus sequences. I will work on that.

Xinglab / TideHunter

subPos & match score feature request #5