Detailed calculation of breadth (and coverage)

anniewest commented 4 months ago

Hi Matt,

Thanks for InStrain, it's a great piece of software! I'm no bioinformatician, but I've been trying to wrap my head around how the profile pipeline calculates breadth on a scaffold and genome level. After trawling through your underlying scripts, I have to say I'm no closer to an answer so I thought I'd open an issue to ask!

My understanding of the breadth calculation is that you count all the bases for a given reference genome that have at least one read mapped (i.e. >0), then divide that number by genome (or scaffold) length. Is this correct? Is there a fundamental step I'm missing where this is only based on scaffolds that have minimum coverage? I've tried looking through all the issues on this github page, but to no avail (I'm aware that CoverM is used in the quick_profile function, but not sure how that differs to what happens in the profile function). In order to compare the output, I also ran the BBMap pileup.sh basecov function on bam files (also used for inStrain), for which I've mapped my metagenomic reads to my set of reference genomes (bowtie2 --sensitive), and applied the formula described above to calculate breadth. These outputs differed (hence my question here), so I'm wondering where the differences are between the two methods? My gut instinct is that I'm missing something really obvious here but I'm too deep in the woods to figure it out.

On another note, I was curious as to why you report coverage as an average of reads mapped per scaffold, as opposed to traditional methods that normalise based on gene length (and often apply a cut off e.g. >10 reads mapped)?

Cheers!

MrOlm commented 4 months ago

Hi @anniewest -

Your intuition on how breadth is calculated is correct- that's exactly how it's done. The reason why you might see a difference between inStrain and BBMap pileup.sh basecov is due to the initial read filtering that inStrain does, which discards some reads from the .bam file.

I also believe that the way inStrain reports coverage is the most "traditional"; it's just the average number of reads mapped to a region. InStrain's coverage = coverM's "mean", and inStrains "breadth" is coverM's "covered_fraction". https://github.com/wwood/CoverM

Let me know if you still have questions!

Matt

anniewest commented 3 months ago

Thanks Matt! I did actually have a follow-up question related to checkM:

I've run inStrain on a dereplicated set of genomes derived from my metagenome samples. The outputs suggest little-to-no strain diversity for those genomes (popANI 0.9999, breadth > 0.5) among samples, despite a few of the genomes having fairly large strain heterogeneity estimations from checkM.

I'd be interested to hear your thoughts on the matter if you had a spare moment? I have some evidence to believe that for one of the dereplicated genomes, the other bins in that dRep secondary cluster were much larger than expected for that species and may be assembling reads from two closely related species (based on phylogeny placement with GTDB representative genomes for the genus), resulting in high checkM strain heterogeneity estimates (of reasonable contaimination levels e.g. the winning dRep genome had 12.6% contamination of which 75% was est. strain diversity, while the other bins had 70-82% contamination of which >95% was est. strain diversity).

Many thanks :) Annie

MrOlm commented 3 months ago

Hi Annie,

Yes- your explanation makes total sense to me. It seems like your bins are impacted by having multiple related strains, but that the intra-population diversity is low. Rather than look at within-sample popANI, you might be able to see this with metrics like nucleotide diversity and/or SNVs / kb.

But all-in-all intra-population diversity and this strain heterogeneity of bins are pretty different things, so it's not surprising they show different trends.

Best, Matt

MrOlm / inStrain

Detailed calculation of breadth (and coverage) #185