esteinig / nanoq

Minimal but speedy quality control for nanopore reads in Rust :bear:
MIT License
109 stars 9 forks source link

Using nanoq for PacBio reads #41

Closed krausfeldtle closed 8 months ago

krausfeldtle commented 9 months ago

Hello! Are there any concerns or considerations with using nanoq for PacBio reads? Thank you!

esteinig commented 9 months ago

TLDR: None that I am aware of :) PacBio HiFi average read quality scores should be consistent with how average read quality scores are calculated for ONT reads.

Note that the average read quality scores for ONT data are calculated using the average per-base error probability on a Phred+33 log scale (nanoq/src/needlecast.rs L279-L311, see example variable mean_qual):

/// Utility function to compute mean error probability from quality bytes
///
/// This function computes the mean error probability from quality bytes,
/// from which the mean read quality can be computed.
///
/// Quality encoding: Sanger Phred+33 --> ASCII: 33 - 126 --> Q: 0 - 93
///
/// # Example
///
/// ```rust
/// use needletail::parser::{FastqReader, FastxReader};
/// let fastq = b"@id\nACGT\n+\nIIII";
///
/// let mut reader = FastqReader::new(&fastq[..]);
/// let record = reader.next().unwrap().unwrap();
/// let qual_bytes = record.qual().unwrap();
/// let error_prob = mean_error_probability(&qual_bytes);
/// let mean_qual = -10f32*error_prob.log(10.0);
///
/// assert_eq!(error_prob, 0.0001);
/// assert_eq!(mean_qual, 40.0);
/// ```

fn mean_error_probability(quality_bytes: &[u8]) -> f32 {
    let mut sum: f32 = 0.0;
    for q in quality_bytes.iter() {
        sum += 10f32.powf((q - 33u8) as f32 / -10f32)
    }
    sum / quality_bytes.len() as f32 // mean error probability
}

I have not worked with PacBio in a long time, but my understanding is that HiFi data which uses Phred+33 as per SAM format specification (https://samtools.github.io/hts-specs/SAMv1.pdf) should be consistent with this (https://pacbiofileformats.readthedocs.io/en/latest/BAM.html)

You can read more about how quality scores are calculated and relate to read accuracy measured by alignments using the ONT neural network basecallers here - most of it is quite in depth and describes the calculation of the per-base quality score calibration, so not relevant for your question. The relevant section for read Q scores is under Guppy read Q-scores and read accuracies

https://labs.epi2me.io/quality-scores/

I'm not sure if this information is still relevant (could only find links to years old questions or documentation) but perhaps one thing to look out for is the subread base specification, which may be assigned placeholder quality scores of Phred 0 - ASCII character ! (https://bioinformatics.stackexchange.com/questions/885/which-quality-score-encoding-does-pacbio-use). However, I am not certain this still applies. I don't have PacBio data available to investigate unfortunately.

Do let me know if you run into any inconsistencies with your data or if there are conceptual differences you are aware of with PacBio! I will keep the overall implementation minimal and focused around ONT, but happy to consider any fixes / small features additions.

krausfeldtle commented 9 months ago

Thank you so much for your quick response! I will certainly let you know if I run into any issues! Lauren

esteinig commented 8 months ago

I'll close this for now, but feel free to re-open if you have more questions or feature requests for improving PacBio compatibility :)