Xinglab / TideHunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain
https://github.com/yangao07/TideHunter
MIT License
20 stars 2 forks source link

Unallowed quality scores in fastq output format #12

Closed da-i closed 2 years ago

da-i commented 3 years ago

Hi Xinglab and @yangao07

Thanks for making TideHunter, awesome idea!

I've ran into an issue with using the fastq output format of Tidehunter where there are probably whitespace characters in the output of the fastq.

ive created a docker file for reproducability:

FROM centos:8.4.2105

RUN dnf install wget make gcc gcc-c++ zlib-devel -y

RUN wget https://github.com/yangao07/TideHunter/releases/download/v1.5.1/TideHunter-v1.5.1.tar.gz
RUN tar -zxf TideHunter-v1.5.1.tar.gz 
WORKDIR /TideHunter-v1.5.1
RUN make
# add all the spelling cases
RUN ln -s /TideHunter-v1.5.1/bin/TideHunter /bin/tidehunter
RUN ln -s /TideHunter-v1.5.1/bin/TideHunter /bin/Tidehunter
RUN ln -s /TideHunter-v1.5.1/bin/TideHunter /bin/TideHunter

within the docker:

[root@c1c0c9805314 TideHunter-v1.5.1]# tidehunter --version
1.5.1
[root@c1c0c9805314 TideHunter-v1.5.1]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="8"
ID="centos"
...

the input for this first example that works well is:

@readwith5copies
ATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTNATNGACTGNCNCCANANGGCTAAAGTATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTNATNGACTGNCNCCANANGGCTAAAGTATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTNATNGACTGNCNCCANANGGCTAAAGTATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTNATNGACTGNCNCCANANGGCTAAAGTATAATAGGATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTNATNGACTGNCNCCANANGGCTAAAGT
+
AAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJJJF#F#FJJ#F#JJJFJJJJJAAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJJJF#F#FJJ#F#JJJFJJJJJAAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJJJF#F#FJJ#F#JJJFJJJJJAAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJJJF#F#FJJ#F#JJJFJJJJJAAAFFJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJJJJJJJJJJJJ#FJ#JJJJF#F#FJJ#F#JJJFJJJJJ

the command

$tidehunter -f 3 /data/testfqth.fastq 

the output is:

@readwith5copies_rep0_5.0 375_1_375_69_89.3_0_9,84,159,234,308
ATCCCTTTTCCTGGAGCTGCCTTTAGGTAATGTAGTATCTATGACTGCCCAAGGCTAAAGTATAATAGG
+K00188_readwith5copies_rep0_5.0 375_1_375_69_89.3_0_9,84,159,234,308
]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
[main] Real time: 0.022 sec; CPU: 0.008 sec; Peak RSS: 0.009 GB

All fine over here, as i said before.

but when using the below dataset:

@somemorerandomread
TTATGCTTCGTTCAGTTACGTATTGCTCATCTTGTTGAGGGCCTCACAAGCTCCGTCATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCCGGGCGGAGGTGTGGAATCAACCCTGGCTACAGGGGCAGGTCTTGGCCAGTTGGCAAAGCATCTTGTGGTAGAGGCCTCCCAGCCTCCGTCATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCGGGCAGGGTGTGGAATCGACCCACAGCTGCACAGGAGCAGGTCTTGGCGGGTGGCAAAACATCTTGTTTGAAGGCCTCTACAACCTCTGTCATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCGGGCGGGGGTGTGGAATCAACCCACAGCTGCACAGGGCAGGTCTTGGCCAGTTGGCAAGCAATACGTAACTTG
+
$&&(&//2698;:7;;111--.)+//266624566893906;<>9)42368>>:?><0:7?66856998752480(;:>>;:=<5:;=9<;:;;902-(6747<>9:@@')9187;499751.*,:<<3766.'''23462.474568*,886BIG766.0/$399580.$&$)&&+-+222..&4%14::8998048>?>@?:<=:97568-(+46488;=@?@=:67<<;<664.02259*1(%&,+/43543/**7)*422%&)#+2053*6$%$('*,11113122*(%3456897'3::6:8306-*000,5387%+.9:9:A@6<<<=?<>>:>;@==9><9:>?<:<>A9726457:C=:BC;74<&(72443./@;<;4*6:==?CC@>7,-88923;1<644/65<:745=3865631)33788--,&.*--$$(1.-+)###%

we get:

tidehunter -f 3 /data/testfqth2.fastq 
@somemorerandomread_rep0_3.0 453_28_438_138_93.8_0_59,197,334
TGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCCGGGCGGAGGTGTGGAATCAACCCTGGCTACAGGGGCAGGTCTTGGCCAGTTGGCAAAGCATCTTGTGGTAGAGGCCTCCCAGCCTCCGTCA
+somemorerandomread_rep0_3.0 453_28_438_138_93.8_0_59,197,334
�������
[main] Real time: 0.008 sec; CPU: 0.008 sec; Peak RSS: 0.008 GB

Is this a bug?

I've also noticed that sometimes the length of the quality scores are not matching with the consensus sequence. But i think that is related to the aforementioned characters.

da-i commented 3 years ago

Hi @yangao07 , I saw there was the release of 1.5.2. Is that related to this issue?

yangao07 commented 3 years ago

Yes. I fixed this bug in v1.5.2. It's weird that I remember that I have replied to you in this thread several days ago. Anyway, please try out the new version.

Yan

da-i commented 3 years ago

No problem at all, Ive tested v1.5.2. All looks good! thanks for fixing this issue. As a side node, are you planning on adding the quality to the table output at some point? The flexibility the table output provides is amazing!

yangao07 commented 3 years ago

That should be an easy update. Will do that later.

yangao07 commented 2 years ago

Just add the Tabular with quality score output format (-f4) in TideHunter v1.5.3.