CDCgov / phoenix

🔥🐦🔥PHoeNIx: A short-read pipeline for healthcare-associated and antimicrobial resistant pathogens
Apache License 2.0
52 stars 19 forks source link

[Question] - Why are coverages calculated with assembled length instead median genome size of species? #51

Closed rpetit3 closed 1 year ago

rpetit3 commented 1 year ago

Hi PHoeNIx Team,

Looking here (https://github.com/CDCgov/phoenix/wiki/Pipeline-Overview#qc-of-assembled-scaffolds--500bps) I noticed the raw and trimmed coverages are calculated using the assembly length, but the assembly ratio is calculated with median genome size of species. This to me seems problematic, since the assembled genome size can occasionally be quite off from the actual genome size.

Here's an example using S. aureus which has a genome size of about 2,800,000 bp. Let's say I sequence a S. aureus genome and the raw FASTQ has 280,000,000 bp.

Based on sequencing I have a raw estimated coverage of 100x (total sequenced bases / genome size).

Now let's say I get an assembly length of 1,400,000 bp, based on using the assembly length I now have an estimated raw coverage of 200x. Similarly, if my assembly length is 5,600,000 bp, I now have an estimated raw coverage of 50x.

For me, the coverage should reflect how much of the genome you sequenced (stable), not how much you assembled (unstable).

Thank you for your time! Robert

jvhagey commented 1 year ago

Hi @rpetit3 thanks for your thoughtful question, we agree that with our coverage calculation when assembly goes poorly this will lead to an inflated or depressed coverage estimate. However, we take a holistic view of assembly QC as no one metric can be used for "passing" or "failing" an isolate, but rather each metric taken together provides a picture of the whole. Which is why we include other QC metrics in the output, particular the assembly ratio. So in the case where you have high coverage, your assembly ratio would be significantly <1 and the two metrics together inform if the assembly is of good quality. In the example you give, if your assembly ended up being 5,600,000 bp, when the median genome size is about 2,800,000 bp then you have an assembly ratio 2 (which would be flagged) and we’d advise to consider the metrics together and not in isolation. I get your point on stability versus instability though. In practice, is the coverage metric in PHX wildly off from your estimate? I am imagining that when sequencing/assembly "goes well" our estimates are similar? The assembly QC is just one checkpoint in the pipeline and what you do with that information depends on what you intend to do with it downstream (i.e. are you going to use it for outbreak analysis or are you only interested in if a particular AR gene is a new variant).

Additionally, coverage based on the median genome size can also be problematic depending on the organism. For example, there are a many species that only have 1 or a few genome sequences and using a median in those cases doesn't provide the confidence one might initially think, without knowing the total number of genomes. As another example, Pseudomonas can have a varying number of large plasmids (genome can range from 5.5-7Mbp) using its median genome size could result in overestimates/underestimates of coverage again. It seemed practical to us to have a set way to calculate the coverage rather than building in specifics for particular species we know might cause problems if we used the median. We are certainly open to seeing data that strongly supports one over the other. Hope this explains our reasoning better and appreciate you opening the issue to discussion.

rpetit3 commented 1 year ago

Thank you @jvhagey! Haha totally agree with using median genome sizes for the random species with a huge range.

On the Bactopia side, I use coverage estimates pre-assembly to subsample to an estimated 100x, as well as stopping the pipeline (e.g. too low coverage). In most cases, exactly like you said, our coverage estimates are pretty close to one another. I was honestly curious if there was a preference on assembly length vs genome size for coverage estimates.

Thanks again, I'll go ahead and close this