Xinglab / TideHunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
https://github.com/yangao07/TideHunter
MIT License
20 stars 2 forks source link

unmatch number of output unit sequence and copynum #10

Closed Catvick26 closed 3 years ago

Catvick26 commented 3 years ago

Hi Yan,

I observed this weird behavior. Not sure if it is designed on purpose, or maybe I didn't run it in the correct way.

While running with default parameters, output.fasta reports 3 tandem repeats. However while running with '-u' option, there are only 2 unit sequences generated.

Command1: ~/software-install/TideHunter-v1.4.3/bin/TideHunter test.fasta > output.fasta Header of output.fasta:

test_27786_6145_27755_7240_3.0_88.3_0_13302,20542,27689 (format looks slightly different from what is describe in github page though, but I guess '3.0' represents the copyNum)

Command2: ~/software-install/TideHunter-v1.4.3/bin/TideHunter -u test.fasta > output2.fasta Headers of output2.fasta

test_rep0_sub0 test_rep0_sub1

How can I get the correct number of unit sequence? Thanks!

yangao07 commented 3 years ago

Hi,

The option -u only gives you the unit sequence if it is a full copy of the repeat. The copy number in the fasta name indicates all the copies, including these non-full copies on both ends.

TideHunter does not always find the repeat unit at the very beginning of tandem repeats, this is why the # unit sequences may be smaller than the total copy number.

Yan