DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
99 stars 7 forks source link

Is it possible to add option to split uncorrected regions instead of "trimming" them? #6

Closed jelber2 closed 4 years ago

jelber2 commented 4 years ago

Hi,

I have to confess that I have only used Ratatosk once, but I was just thinking about the option for "trimming". The program GraphAligner (https://github.com/maickrau/GraphAligner) offers something where instead of trimming off the low quality or uncorrected bases/regions that it can "clip" the long reads into say parts (if I understand it correctly with the GraphAligner option --corrected-clipped-out instead of --corrected-out [I think --corrected-out is equivalent to the "trimming" option of Ratasok).

GuillaumeHolley commented 4 years ago

Hi @jelber2,

Could you be more specific into what the clipping is? I couldn't find any description of the option --corrected-clipped-out on the GitHub repo of GraphAligner. My guess is replacing the uncorrected bases with Ns or using upper/lower case characters to distinguish corrected/uncorrected bases.

jelber2 commented 4 years ago

Yes, you are correct that the current README.md (https://github.com/maickrau/GraphAligner/blob/7644b5dd19788501e7de855b8eb9fd5f52e1359c/README.md) doesn't have this information. I saw it at https://www.biorxiv.org/content/10.1101/810812v1.full.pdf+html

Our pipeline has two modes:
either we output the full reads, keeping uncorrected areas as is; or clipped reads,
which remove the uncorrected areas and split the read into multiple corrected sub-
reads, if needed.

So the first part is --corrected-out and the second part is --corrected-clipped-out

Pretty much from when I use it, --corrected-out has similar read lengths to the original, raw long-reads, and has lower quality scores estimated with yak (https://github.com/lh3/yak) and --corrected-clipped-out has shorter read lengths to the original, raw long-reads, but has higher quality scores estimated than the output from -corrected-out reads.

Also, unitig length distribution of the graph used for correction greatly influences read length and quality of the either --corrected option with GraphAligner. Pretty much I have found that if the unitig lengths are short, then one will get shorter and lower quality corrected reads with GraphAligner, especially in just the --corrected-out mode.

Regarding "masking", I actually think both --corrected modes in GraphAligner softmasks some bases, but I don't know what they represent?

GuillaumeHolley commented 4 years ago

Hey @jelber2,

I actually think there was a mix up in the way I named this option in Ratatosk. If you use the trimming option (currently), it is going to split the reads into multiple smaller reads by removing the uncorrected bases from the reads. It will not only trim the beginning and end of the read. So right now, -t in Ratatosk is the same as --corrected-clipped-out in GraphAligner. I will fix the name of that option in the next Ratatosk release (next week).

Now, I don't advise clipping/splitting the reads after correction in general. As you said, the pros of read clipping is that you get much higher quality for your reads. However, read N50 decreases, sometimes quite significantly, which might penalize all downstream analysis, especially if you intend to do assembly. Also, uncorrected bases are not necessarily "bad" bases. Often, high complexity regions and very heterozygous sites are left uncorrected. Complex SVs might be tricky too. In my experience, the number of uncorrected regions left is much higher in centromeric and telomeric regions compared to other regions. So in my opinion, uncorrected bases are still better than no bases at all.

I am also not surprised by your observation that unitig length influences read quality. Shorter unitigs usually go in pair with more branching in the graph. More branching there is and more complicated it gets to find the right path for correction. Tandem Repeats usually end up with very branching short unitigs sub-graphs for example. However, I'm a little bit more surprised as to unitig length influences read length.

jelber2 commented 4 years ago

Well, I could be wrong about the read length part. I haven’t systematically tested that.

GuillaumeHolley commented 4 years ago

I didn't notice anything myself but I could test for it by using different k-mer lengths in Ratatosk. In any case, thank you for notifying me about this naming mistake, I might just push a fix for it this afternoon. I will close the issue after the fix is online :)

jelber2 commented 4 years ago

Is it possible to allow users a choice in different k-mer lengths than the default k=63? Also, have you tried Ratatosk on PacBio CLR reads? Sorry for the additional question.

GuillaumeHolley commented 4 years ago

No worries.

I can provide an option to play with different k-mer sizes. Ratatosk uses two k-mer sizes for each of the two correction rounds such that k2 >= 2 * k1. By default, k1 = 31 and k2 = 63 in Ratatosk. So if you want to change one k-mer size, you will have to change the other as well.

I never tried Ratatosk on PacBio CLR but I was recently in touch with a couple of people trying out Ratatosk on PacBio reads and they seemed satisfied with their results :)

jelber2 commented 4 years ago

Okay. If say KmerGenie suggests using a k=50 for “optimum” de novo assembly of short reads, would you still recommend using a k1=50 even though k2=101?

GuillaumeHolley commented 4 years ago

I think this is too long. The thing is that the size k use to build the unitigs of the graph is also the size of anchors between the graph and the long reads before correction. So while a longer k seems to be a good idea to increase unitig contiguity, it also means less exact k-mer matches between the graph and the long reads due to the high error rate of these reads. Better sequencing chemistry or basecalling algorithms would make that possible in the future. In the meantime, I doubt you will be able to use k1 > 31 without degrading the correction results.

jelber2 commented 4 years ago

Okay, thank you. I didn't think about the k-mer matches between the graph and the long reads. Just thinking about that idea, when I simulate contamination of say 10000 Drosophila and 10000 human PacBio reads, I generally use a k={31-41}. That is to say programs such as BBDuk.sh (no max k value) and Seal.sh (max k=31) from BBMap/BBTools (https://sourceforge.net/projects/bbmap/) generally have the lowest assignment error-rates for k=31 for Seal and k=41 for BBDuk.sh (whether you pretend that human reads are contamination or fly reads are contamination - that is what simulated reads match human or fly reference, respectively). So, I am just wondering if k=63 is too high if that is the k value used for exact k-mer matches.

jelber2 commented 4 years ago

Okay, based on #9, it seems that you are using k=31 for exact k-mer matches of long reads to graph and k=63 for de novo assembly of the graph.

GuillaumeHolley commented 4 years ago

Hey @jelber2,

Sorry about the delay, I somehow missed your previous message. Indeed, I use k=31 for the first correction pass and k2=63 for the second correction pass (graph construction + anchoring). Using 63-mers for anchoring during the second correction pass is only possible because the first correction pass already lowered down the error rate quite significantly.