DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
95 stars 7 forks source link

UMI reads #20

Closed kokyriakidis closed 3 years ago

kokyriakidis commented 3 years ago

Hi! Is there a chance to make it work using Nanopore reads with UMIs?

GuillaumeHolley commented 3 years ago

Hi @kokyriakidis,

Sorry for the delay. Ratatosk does not support UMIs at the moment (at least it wouldn't consider UMI tags in the correction) but it could be supported in a future version. I have no prior experience with UMIs in Nanopore reads, could you reference some public data sets I could have a look at?

On a side note, if the UMIs are used in a transcriptomic context, Ratatosk definitely won't work for your reads (with or without UMIs).

kokyriakidis commented 3 years ago

Hmm, I was thinking about cDNA/dRNA Nanopore data with UMI tags to accurately capture the BCR/TCR receptor sequences (and a way to correct the reads).

Nevermind, thanks so much for the info! You can close this issue and it would be perfect if you can explain in a few sentences why Ratatosk won't work on transcriptomic data, so I keep it mind next time!

GuillaumeHolley commented 3 years ago

Hi @kokyriakidis,

To be honest, I have never tried Ratatosk on cDNA/dRNA Nanopore data so I cannot really say if it is gonna work or not. I think a general issue is that a minimal and maximal coverage is assumed for the Illumina and Nanopore data. To explain this, it is important to know that Ratatosk has 2 correction passes. During the first pass, the graph index is built from Illumina 31-mers and it is colored with Illumina reads. Second pass is graph built from Illumina 63-mers and graph is colored with ONT reads corrected during the first pass. During any of those two passes, if a unitig has too many colors (read mapping to it), the colors are not kept but the unitig stays in the graph. If a unitig has insufficient mean k-mer or color coverage (2), it might be deleted from the graph. Also, when correcting a subread, paths traversing unitigs with very low coverage might just be discarded from the possible (correct) paths. I believe this might be a problem for your application where coverage can be very variable isn't it?

On the plus side, the upcoming version of Ratatosk might be more adapted as it doesn't perform any of color/unitig/path deletion based on coverage.