DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

Found 0 unique k-mers #9

Closed sbultmann closed 4 years ago

sbultmann commented 4 years ago

Hi, I am runng Ratatosk with Ratatosk -v -c 16 -s another.fastq.gz -l ../Nanopore/all_reads.fq -o all_reads_Ratatosk my another.fastq.gz is a sample of the first 1000 reads but the problem is the exact same when I run it with the full data set.

Here is the output of Ratatosk:

Ratatosk::Ratatosk(): Building graph from short reads (1/2).
KmerStream::KmerStream(): Start computing k-mer cardinality estimations
CompactedDBG::build(): Estimated number of k-mers occurring at least once: 1
CompactedDBG::build(): Estimated number of minimizer occurring at least once: 1
CompactedDBG::build(): Estimated number of k-mers occurring twice or more: 1
CompactedDBG::build(): Estimated number of minimizers occurring twice or more: 1
CompactedDBG::filter(): Closed all fasta/fastq files
CompactedDBG::filter(): Processed 0 k-mers in 10000 reads
CompactedDBG::filter(): Found 0 unique k-mers
CompactedDBG::filter(): Number of blocks in Bloom filter is 1
CompactedDBG::construct(): Extract approximate unitigs
CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2)
CompactedDBG::construct(): Before split: 0 unitigs
CompactedDBG::construct(): After split (1/2): 0 unitigs
CompactedDBG::construct(): After split (2/2): 0 unitigs
CompactedDBG::construct(): Unitigs split: 0
CompactedDBG::construct(): Unitigs deleted: 0

CompactedDBG::construct(): Joining unitigs
CompactedDBG::construct(): After join: 0 unitigs
CompactedDBG::construct(): Joined 0 unitigs

CompactedDBG::write(): Writing graph to disk
Ratatosk::Ratatosk(): Graph is empty, no correction can be done. Output uncorrected reads.

My short read fastq file looks like this:

@L183:596:CE9G3ANXX:3:2201:1981:2071 1:N:0:0
NTAGTGACCCTATGTACAAGAGCCCTAAACACTGCTTTATCCTGTGCCAA
+
#=A<BFGEGGGGGFGGG1F1=FGGGGGGGGGGDDGGFDEFFEGGGGGGGG
@L183:596:CE9G3ANXX:3:2201:1816:2077 1:N:0:0
GTGTGCTTGTAAGGATGAGGGGCACCCTAGGTTTCCATCATGTGGAGAGG
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@L183:596:CE9G3ANXX:3:2201:1787:2079 1:N:0:0
ACTTTTCACTTATATTTATCCCAGTTGCCCTATTTGTTACATGATCAATT
+
A=BB=@FFBFC>FGGEGCE@;CFGBD111=C@FFF1>CCFGGEGGCEG>F
@L183:596:CE9G3ANXX:3:2201:1990:2084 1:N:0:0
ATATTGAGCTTCTCCATCAAGTCTTTCCACAAATGTAGTCGTTCTGACTC
+
AB@AB1GGGCGGGGGGGGGCCGG>DFGGGGGG>FDGD1:1CFGGGGCFGG
@L183:596:CE9G3ANXX:3:2201:1763:2084 1:N:0:0
CACTAGTATTACACACAAGAAAAATCTTGCTGAAGATTATGCAAAAATGA
+

the nanopore reads look like this:

@4723f985-8ecd-4a38-87d8-7e54139cf343 runid=9df1038c4d6eb87eb4f188c7c62f93cbdf04ba22 read=12 ch=2701 start_time=2020-07-22T14:40:58Z flow_cell_id=PAE88064 protocol_group_id=20200722_DNA_Bultmann_Ruesselspringer_fb sample_id=no_sample
TGTGCTTCGTTCAGTTACGTGTGCTTGTAAGAGACAGTATTGAATAATAGTGAAGAAGTTGTGAGAGCAACATTTTAGTACATTATTGTAAATTTTATGATATTTATAAAGTTATATATTAATAAGAAGCAACATATGAAGTAACAGCCTATGGTGATGAAGGAAAAGTTCAGGCCTGCTCTGAAAAAATTGGCTAAAAATATAAGGCCCCAGGAATTGATGGTATACCAATTCACATGTTCAATTAATAGATGCTGCCTAGATATGCCCACCAAGTTGTGCCAGAAATCTGGAGAACAACCACTTGTATCTGACTGGAAGAGATCTATGTTGTTCCCATTCCAAGAAAGTGATCATTACAGAGTGTGGAAATTATCAAACAATATCATTAATATCACGTGCAACCAAGATTTTGCTGAAGATTATCAGAAATGACTGGAACTACCAACAGAAAACTACAGAAGTGCAGGCTGGATTTAGACTGGGCCATAGTGCCTAGGGATATTATGACTAATGGTTGACGGATCTTGGCTGAAAGCAAGAATTACCAAAACAATGTCTACCTCTTTCATTGATTATGTAAAGGCTTTAGACTGTGTGGATCATAACAAACTCACTAACTCTGAGAAGAATGGGAATCCCAGAATGCCATGATTGTGCTAATGTAAAACCTATACTTGGACCAGGAGGCAGTCGTCAGAACAGAACAGGAAGCACTTTATGTTTCAAAATTGGGAATGGTATATGCCAGGGCTGTATTCTATCTCCATACTTTTCAACCTTTATGCAGGTTATCTGGGAGTCTGGGATCTAGCAATACGTAGCA
+
(&+$--2--$9$'4985*'-($)67732720+2/..2=>>:<>/+;+',+;489&'((@>:5+'$&06>5<D@AB4/2114867456)')&58<;4)302/-***33A=28:>87999A>A(%%*$3914:3792%',,-25689;$68@A:;;.,+-+,-143+*,2;3/**)%*+))23:<@?E=>99:7200),77-)+%50-.,,,*10'1>>>?<;::<967;<<?CA0*$+,(721/144655:.==<9;3%7032;+&'',,+--.221;5-3....(7>=?3..031>:;A945*$$&&#&$&+,069474/.18;98=3&&=>8>B;<BAA;36<5+3./765464$$'').*.66642*/01128?>==2780:;@B<=>>78;=9:026459(&43/+,.01)0366:7=;9624659'G@66:EA=>=C==*99?203.)13+&%%&%$+/51+(++2***64=AB1267654511,')2,(')++/77645893*)%+*(&0),4040-.4:86;92598738=113-/&(&(17877:08<&&&37231,3*=BC=>A@87>C:77258117333'09;9<<64342&&%'$+-9;;?;;93&8<;;<861-238<@>771//6374,.3A?3/')**'%62+40-.0851--07)357)%%6<:72+()*+*%/00039/-++0.,(%,((&**)9896::;=>:0:>AA=.A?A88845387558:1//**'()-478956:>>:999412)&&&23:67;>7787842'&&&&&;<;=-(()-.8=741026755&&&6-**('(($#&
@9e53c1ac-4e06-41d0-97ec-f4312564af9d runid=9df1038c4d6eb87eb4f188c7c62f93cbdf04ba22 read=11 ch=1809 start_time=2020-07-22T14:40:58Z flow_cell_id=PAE88064 protocol_group_id=20200722_DNA_Bultmann_Ruesselspringer_fb sample_id=no_sample
AATGTACTTCGTTCATTTGTATTGCTGGTGATTTTTATGCTCATAAAATTTTCAGAAGCACTTTAGAGGTAAGTGTCATCTCTAAAGGTAGAAATTATTGTCCCATATTATTGTTGTTAAGTACCATCAGTCACAATCACCCAAACTGATCCTCCAGAAAGGTTTGTCTGAGGCTGTGCCCTTACTGAAGGAGATTGTCATCCTTCTCCCAGGGACCATAATGGGTTGGTTGTCTTTATGGATTACCGTGAAGTTCTTACCACTAGGCCCCCAGAGCTTTCTAAAAAGATATTAGCCAATTTAATGAATTTTAATGACAATAGATAATACCTATTGTTTCTCTCCTGCATGCAATATTCATGTGACTTTTACTTTATAATGGAGACATGACTGTAAACATTCATTTTGAGATAATAGGAAACACAAGTGACTGAATATTTCTACA
+
(&,+**39=;/=<)&+.)(-./10<;/1889BCE?1.88;?8.8E<=8;=;<<;)99*,+&&-$%$%')))&(5+63$22*9=@?+'1,,,$$$#%-354<=1,:8889++/2AB:?@A8.,.,,*..'1:6;;>9994669;9(##()978B9854933-:;=B9;:93/4:=314/++&%&&&667<91&.%%(598/=99>@;==<8,+27923(%$&$$&%0+%&$)%&$&((#'$*>@@641287657BCJEC3:<;;2.1=>?1278*72DNJH?BGLM=7@9558:3534;39:53768=EBDB<=?<<<;>=:>:??A>9:<=5255&.8.(%(9965337%$3:99:-955<7<==6=A?59;=<<6<>4=9@:@A<B@@<@D7>@E@=DAB@@??8(,48>>@;79)+3?;6>?BB;:;@?<>@AAACDGB9-+'
@ea593a53-1da1-4890-9311-abb659f51a7a runid=9df1038c4d6eb87eb4f188c7c62f93cbdf04ba22 read=9 ch=616 start_time=2020-07-22T14:40:58Z flow_cell_id=PAE88064 protocol_group_id=20200722_DNA_Bultmann_Ruesselspringer_fb sample_id=no_sample
AGCACCGTTCCAGTTACATGTGCTCACTTTCAGCCAGTTATCCCACTCCAGAGACCTGAGTTTGGTGTATTGCTCATTTTGTTTGTCCCTGATTCAAGGCGGATTTGCAGACGTTTGCCTTTTAATATTTTCCCAGACTCTGTGTCTCAGTCATTCACTGCCTTTTTTACACATCTTCTTGTCTCGAAGGGTTCTCACCTGTACATGCATCGTTCCAAACCACCATGAATCCGAGTTGTCTTGATCCTTCCGTGTGGCCATTTCGAGAGCATCTCCTGTTACCCACAGCTCATGAGTTCCTGTCTCCATGTCCAGGAGGAGGAGCTAAGCAATATGTACGTAATAA

Any idea what could be going on?
Thanks!

GuillaumeHolley commented 4 years ago

Hi @sbultmann,

The problem here is that your input short reads are too short. More specifically, Ratatosk builds a first compacted de Bruijn graph with the 63-mers of the input short reads and then, it uses the unitigs of that graph to build a second compacted de Bruijn graph with 31-mers. Since your short reads are 51 bp long, the first graph ends up being empty (because none of the reads contains a 63-mer) and so is the second graph. And if it empty, no correction is possible. I've never seen reads like this, may I ask how you ended up having such short short reads?

I was thinking to offer an option to change the k-mer size in Ratatosk, although I don't advise to do it in the general case. But in your use case, you could set k=50 and provided that you get a "high" coverage for your short reads, it might just work.

sbultmann commented 4 years ago

Hi @GuillaumeHolley,

hmm. You are right these are quite short but I thinks thats only the first few. this is the general overview about the nanopore seq run:

General summary:        
Mean read length:                 5,244.0
Mean read quality:                   12.0
Median read length:               2,843.0
Median read quality:                 12.2
Number of reads:              7,227,909.0
Read length N50:                  9,831.0
Total bases:             37,903,197,442.0

do I have to remove the short reads?

the short reads I use for the option -s are standard 50bp PE illumina reads.

GuillaumeHolley commented 4 years ago

Hi @sbultmann,

Your Nanopore run stats look just fine but the long reads are not the issue here, the short reads are. And Ratatosk won't work without short reads in input because it was specifically designed for hybrid correction. What's your short read coverage? I've never seen 50 bp Illumina reads before, not for DNA sequencing at least. However, 50 bp and 75 bp PE Illumina reads are more common for RNA sequencing as far as I know. Is it the case here? Because if it is, some assumptions that Ratatosk are making might just not work for your short reads anyway.

sbultmann commented 4 years ago

Hi @GuillaumeHolley,

thanks for your quick reply. I wasn't aware that you need at least 63pb long reads. Is this mentioned in your biorxiv? This is DNA-seq and we frequently use 50pb PE for this. Is there a reason why it would work with 50bp? longer reads are helpful to increase coverage of course but are there any reasons why 50 kmers are a bad idea. I have around 440 mio paired reads which should equate roughly to a 10fold coverage.

thanks for your help!

GuillaumeHolley commented 4 years ago

Hi @sbultmann ,

If you look at the preprint in Appendix A, we have all the default parameters of Ratatosk. In this section, you can see that the k-mer size k2 is 63 which implicitly requires that you reads are at least 63 bp long. Now, as I said before, I can provide an option in Ratatosk to change the k-mer size k2 and you could set it to 50, meaning the k-mer size will be your read length. This will make sure the graph is built and the correction runs. However given that problem and the 10x coverage, I'm not sure how will perform the correction.

GuillaumeHolley commented 4 years ago

Hey @sbultmann,

I am closing the issue. I think right now, there is not much that can be done with your short read length and coverage. Let me know if you are interested by the k-mer size option.

Guillaume

GuillaumeHolley commented 4 years ago

Just in case, the k-mer size option as been implemented (see README).