Closed sbultmann closed 4 years ago
Hi @sbultmann,
The problem here is that your input short reads are too short. More specifically, Ratatosk builds a first compacted de Bruijn graph with the 63-mers of the input short reads and then, it uses the unitigs of that graph to build a second compacted de Bruijn graph with 31-mers. Since your short reads are 51 bp long, the first graph ends up being empty (because none of the reads contains a 63-mer) and so is the second graph. And if it empty, no correction is possible. I've never seen reads like this, may I ask how you ended up having such short short reads?
I was thinking to offer an option to change the k-mer size in Ratatosk, although I don't advise to do it in the general case. But in your use case, you could set k=50 and provided that you get a "high" coverage for your short reads, it might just work.
Hi @GuillaumeHolley,
hmm. You are right these are quite short but I thinks thats only the first few. this is the general overview about the nanopore seq run:
General summary:
Mean read length: 5,244.0
Mean read quality: 12.0
Median read length: 2,843.0
Median read quality: 12.2
Number of reads: 7,227,909.0
Read length N50: 9,831.0
Total bases: 37,903,197,442.0
do I have to remove the short reads?
the short reads I use for the option -s
are standard 50bp PE illumina reads.
Hi @sbultmann,
Your Nanopore run stats look just fine but the long reads are not the issue here, the short reads are. And Ratatosk won't work without short reads in input because it was specifically designed for hybrid correction. What's your short read coverage? I've never seen 50 bp Illumina reads before, not for DNA sequencing at least. However, 50 bp and 75 bp PE Illumina reads are more common for RNA sequencing as far as I know. Is it the case here? Because if it is, some assumptions that Ratatosk are making might just not work for your short reads anyway.
Hi @GuillaumeHolley,
thanks for your quick reply. I wasn't aware that you need at least 63pb long reads. Is this mentioned in your biorxiv? This is DNA-seq and we frequently use 50pb PE for this. Is there a reason why it would work with 50bp? longer reads are helpful to increase coverage of course but are there any reasons why 50 kmers are a bad idea. I have around 440 mio paired reads which should equate roughly to a 10fold coverage.
thanks for your help!
Hi @sbultmann ,
If you look at the preprint in Appendix A, we have all the default parameters of Ratatosk. In this section, you can see that the k-mer size k2 is 63 which implicitly requires that you reads are at least 63 bp long. Now, as I said before, I can provide an option in Ratatosk to change the k-mer size k2 and you could set it to 50, meaning the k-mer size will be your read length. This will make sure the graph is built and the correction runs. However given that problem and the 10x coverage, I'm not sure how will perform the correction.
Hey @sbultmann,
I am closing the issue. I think right now, there is not much that can be done with your short read length and coverage. Let me know if you are interested by the k-mer size option.
Guillaume
Just in case, the k-mer size option as been implemented (see README).
Hi, I am runng Ratatosk with
Ratatosk -v -c 16 -s another.fastq.gz -l ../Nanopore/all_reads.fq -o all_reads_Ratatosk
myanother.fastq.gz
is a sample of the first 1000 reads but the problem is the exact same when I run it with the full data set.Here is the output of Ratatosk:
My short read fastq file looks like this:
the nanopore reads look like this:
Any idea what could be going on?
Thanks!