Xinglab / TideHunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
https://github.com/yangao07/TideHunter
MIT License
20 stars 2 forks source link

subPos explanation? #3

Closed davmeleuterio closed 4 years ago

davmeleuterio commented 4 years ago

Hello, There is a column explanation from the tabular format that I can't quite understand, which is the subPos. It says: "Start coordinate of each tandem repeat unit sequence, followed by one end coordinate of the last tandem repeat unit sequence, separated by ",", all coordinates are 1-based." I don't understand what I've put in bold. I'll also put this example from my data, which has also gotten me confused, because of that number in bold:

readName: 0b034307-8d13-47d7-8ee1-c21310a38963_runid=1 consN: cons1 readLen: 263 start: 101 end: 230 consLen: 30 copyNum: 4.0 fullLen: 0 subPos: 134,164,197,32622 consensus: TCTCTCTCTCTCTCTCTCTCTTTCTCTCTC

Thank you for your attention, Daniel

yangao07 commented 4 years ago

My gut feeling is that there might be an overflow for the last number. But I can not find it in the code right now, would you mind sharing that sequence here?

Thanks!

davmeleuterio commented 4 years ago

Sure, here is the sequence:

0b034307-8d13-47d7-8ee1-c21310a38963_runid=1 GAACTCTCTCTCTCTCTCTCGTCTCTCTCTCTCTCTCTCTCTCTCTCTACTCTCTCTCTC TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTTTCTCACTCTTTCTCGCTCTCTCAAAA CTCGCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTTTCTCTCTCTCTCTTTCTCTCTCTCT CTTTCTCTCTCTTCTCTGCTCTTTCTCTCTTTCTCTCTTAACTCTCTCTCTCTCTCGCTC TCTCTCTCTCTCTCTCTCTCTTT

I also noticed, now looking at the sequence length (267), that it doesn't match the output readLen (263). Is there any reason why this is happening?

yangao07 commented 4 years ago

The length of the sequence you paste here is 263, not 267. Did you count the newlines?

I run TideHunter-v1.2.1 with default parameters, the output is:

$ TideHunter test.fa
>1_cons0_263_4_103_30_3.3_0_12,43,73
CTCTCTCTCTCTCTCTCTCTCTCTCTCTCT
>1_cons1_263_101_230_30_4.0_0_134,164,196
TCTCTCTCTCTCTCTCTCTCTTTCTCTCTC

Could you also paste your version and the running command here?

davmeleuterio commented 4 years ago

Sorry for a late response, I checked and the counter I was using was counting newlines, so it was giving me a different length size, sorry about that. I ran TideHunter with the next parameters: Tidehunter -f 2 -t 3 test.fa > test.out

The output appears as this: 0b034307-8d13-47d7-8ee1-c21310a38963_runid=1 cons0 263 4 103 30 3.3 0 12,43,74,32575 CTCTCTCTCTCTCTCTCTCTCTCTCTCTCT 0b034307-8d13-47d7-8ee1-c21310a38963_runid=1 cons1 263 101 230 30 4.0 0 134,164,197,33 TCTCTCTCTCTCTCTCTCTCTTTCTCTCTC

That 32575 coordinate keeps appearing. Besides, each time I run the code, it changes to a similar value, such as 32731, 32723,...

When I ran the code as fasta output, it seems to appear just like your output, so may be something related to the tabular output?

Thank you for your attention.

yangao07 commented 4 years ago

Hi Daniel,

Thanks! This is a bug in the tabular output. Sorry about the inconvenience. It is fixed in the latest release: v1.2.2 Please try it again.

Yan

davmeleuterio commented 4 years ago

Thank you, that solved the problem.