DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

Understanding the -u option (different individuals than -s possible?) #7

Closed jelber2 closed 4 years ago

jelber2 commented 4 years ago

Hi,

I was wondering about the -u option for Ratatosk. From the README it says

Ratatosk might use for the correction some unmapped short reads 
(-u unmapped_short_reads.fastq) which are missing in the input subset (-s).

I sort of had an interesting idea regarding using this option to increase coverage, but at the same time might be disastrous. Specifically, I have 5 individuals with whole-genome sequencing data (between 15-30x coverage, 150-bp paired-end Illumina reads), and I thought that I might the -u option to help in correction. So my logic was use the individual with ~30x coverage for the -s input and another individual (or maybe the other 4) for the -u input. What makes things more complicated is that the long-read sequences (PacBio CLR reads, ~14x coverage) actually come from a sixth individual. Ultimately, I am the bioinformatician and not the designer of this experiment. I am aware that even if I could correct the PacBio CLR reads to Q40 that perhaps ~14x coverage is not enough for a good assembly, but I am willing to give different assemblers (Peregrine, Hifiasm, etc. a try).

Any thoughts/insight would be greatly appreciated.

GuillaumeHolley commented 4 years ago

That's a risky experiment and I doubt that the outcome would be any good. For sure your error rate will drop significantly but all your variants will be messed up in the corrected reads. If you really want to try this experiment, I am not even sure how the usage of -u would help. Since your individuals are not related, it doesn't matter which individuals go into -s or -u: just dump all of your short reads into -s.

jelber2 commented 4 years ago

Thank you. I didn't think it would work but was mainly curious of your thoughts.