DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

Handling highly heterozygous genomes in Ratatosk #50

Closed cahuparo closed 6 months ago

cahuparo commented 6 months ago

Hi @GuillaumeHolley,

I am currently working on assembling the genome of a highly heterozygous diploid organism, using a combination of Oxford Nanopore Technologies (ONT - R10) and Illumina sequencing data. Given the significant level of heterozygosity present in my organism, I aim to use Ratatosk for correcting ONT reads with the high-accuracy Illumina reads before assembly. My primary concern revolves around the tool's ability to distinguish between sequencing errors and true haplotype variations.

Specific Questions:

  1. Haplotype Preservation: How does Ratatosk handle reads from different haplotypes in the context of a highly heterozygous genome? Specifically, is there a risk of collapsing distinct haplotypes into a consensus sequence during the correction process?
  2. Parameter Adjustments: Are there specific parameters or strategies within Ratatosk that can be adjusted to enhance its performance in preserving haplotype integrity in highly heterozygous organisms?
  3. Best Practices: Could you recommend any best practices or additional steps in the read correction process using Ratatosk for genomes with high levels of heterozygosity?

My goal is to ensure the highest possible quality and accuracy in our genome assembly, particularly in maintaining the true genetic diversity represented by the distinct haplotypes. Understanding Ratatosk's capabilities and limitations in this context will greatly aid in planning our assembly workflow and optimizing our use of the tool for our specific needs.

Thank you for your assistance and for developing such a valuable resource for the genomics community. I look forward to your insights and recommendations.

Best,

Camilo

GuillaumeHolley commented 6 months ago

Dear @cahuparo,

Thank you for reaching out. In general, Ratatosk was designed with human genomes datasets in mind (so heterozygous but not highly heterozygous). However, on top of my head, highly heterozygous genomes should not affect the performance of Ratatosk and Ratatosk should work well straight out of the box on highly heterozygous genomes, it is just that there is no special settings or tweaks for your type of input genome. Here is for your questions:

  1. Ratatosk was designed specifically to minimize the risk of collapsing different haplotypes when correcting long reads. Yet, there is no such thing as risk 0. The best I can say is that I think Ratatosk does a better job at maintaining phasing compared to many other Illumina-correction tools for long reads out there. If it is of any value, I am currently using Ratatosk in a diploid dual assembly (one assembly for each haplotype) pipeline for ONT R9.4 + Illumina for which haplotype integrity is of major importance and we get very good results. The unique feature in Ratatosk that allows to maintain the read phasing is the usage of a colored de Bruijn graph to "record" where the Illumina reads map in the graph. Most/many Illumina-correction tools for long reads try to find a path in the Illumina-built de Bruijn graph such that the sequence from that path is as similar as possible to the subsequence to correct in the long reads. This strategy is fairly prone to selecting variants from the incorrect haplotype given that the sequence of reference is a noisy long read subsequence with an error rate of 3 to 8%. While Ratatosk uses the same strategy, it also tries to follow a path which is actually a "valid" path from the Illumina reads perspective which means the selected path might be suboptimal from an edit distance perspective but better from a haplotype preservation perspective. Finally, Ratatosk embeds a simple method to detect SNP candidates directly from the graph. I have a lot more details about this in the Ratatosk paper (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02244-4), see Sections "Graph coloring", "Candidate SNP annotation" and "Traversing the graph".
  2. There are no parameters that I can think of to tweak Ratatosk in doing a better job for highly heterozygous organisms. However, Ratatosk was primarily designed for ONT R9.4 which have a much higher mean error rate than ONT R10. It is good news as it means Ratatosk performs better on ONT R10 and it will work perfectly (and already has) on ONT R10. ONT R10 uses Pthred base quality values that reach a value of 90 in the FASTQ instead of 40 for ONT R9.4. You must adjust for this in Ratatosk by using the -Q 90 in the command line (because the default is -Q 40). Also, Ratatosk works best with paired-end short reads in input (-s): input short reads from the same pair must have EXACTLY the same FASTA/FASTQ name (if the reads are extracted from a BAM file, use samtools bam2fq -n). Finally, make sure you use the latest version of Ratatosk on this GitHub (0.9.0) and not the conda version which seems to be broken.
  3. Ratatosk outputs base quality scores in Pthred format for the corrected reads. These scores reflect the level of confidence in the correction of the corresponding base. You can look at those to see how the correction went/is. A score of 0 means the base was left uncorrected (as is), 1 means very low confidence in the correction, 90 would mean very very confident in the correction.

Let me know if any of this is unclear or if I can be of further assistance, Guillaume

cahuparo commented 6 months ago

Hi @GuillaumeHolley,

Thank you for the comprehensive response and the valuable insights into Ratatosk's capabilities and design philosophy, especially regarding haplotype preservation in the context of highly heterozygous genomes. Your explanation and the reference to the Ratatosk paper provide a solid foundation for understanding how Ratatosk could benefit our genome assembly project. Before proceeding, I have a few additional questions to ensure optimal application and results:

  1. Dual Assembly Pipeline Integration: Could you provide more details or examples on how Ratatosk has been integrated into diploid dual assembly pipelines, particularly regarding workflow steps before and after Ratatosk's application? Any specific considerations or adjustments needed for such integration would be highly valuable.

  2. Handling of Structural Variations: In genomes with high heterozygosity, structural variations (SVs) can be as important as SNPs. How does Ratatosk handle or affect the correction of reads containing structural variations? Are there any strategies within Ratatosk to preserve SVs?

  3. Future Updates: Are there any planned updates or features in Ratatosk that could further enhance its suitability for highly heterozygous non-human genomes? Insight into ongoing developments would be helpful for long-term planning.

Thank you again for your time and assistance.

Best,

Camilo

GuillaumeHolley commented 6 months ago
  1. Long story short, our assembly pipeline is very much Ratatosk-focused right at the beginning. It starts with a round of correction with Ratatosk using all the short reads to correct all the long reads. This step is what we call "global correction" and it allows to decrease the mean error rate of the long reads quite a lot. There is a little bit of filtering on the corrected long reads after that where we split the long reads in regions which have low correction scores. We then use the corrected long reads to do a mixed-haplotype ("collapsed") assembly. Corrected long reads and short reads are aligned-back to the collapsed assembly afterwards and we then proceed to do a "local" correction of the (globally-)corrected long reads. We basically process all the contigs of the assembly by non-overlapping windows of 200kb where we corrected together all the reads from that window. Each window correction allows a much smaller and more contiguous graph than during the global correction which in turn leads to a further decrease of the mean error rate. Those globally-and-locally corrected long reads are the basis for the dual assembly.
  2. SVs won't create small bubbles in the graph like SNPs or indels, instead they will create "larger" bubbles that are likely to be entangled within other bubbles (super-bubbles and such) and they might span a length which is much larger than the insert size of a pair-ended short read. In addition to using the colors of the graph as I mentioned before (you want to follow paths with colors that you have already traversed or colors that are in the k-mers shared between the graph and the read), we have a system to anchor the long reads on the graph using k-mers that are an almost perfect match (but not exact) between graph and long read.
  3. Unfortunately, I have to prioritize projects and Ratatosk is not an "active project": I am definitely maintaining Ratatosk but I am not developing new features unless there is a "community" request for it or if I need it myself for another project. So basically, the current Ratatosk roadmap right now is only for bug fixing (if any are found), dependencies updates (if the said update makes Ratatosk faster, more memory-efficient or more accurate) or smaller algorithmical enhancements. If you have a need for something specific in Ratatosk, we can definitely talk about it and depending on how much work it is, I might decide to spend some time on it to make it officially a new feature of Ratatosk but is is unlikely. That being said, I am available to assist with Ratatosk usage and help to incorporate it in your assembly project if you decide to use it.
cahuparo commented 6 months ago

Thank you for taking the time to answer my questions. I will give it try! Best, Camilo