DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
237 stars 73 forks source link

Meaning of abundance in classification summary #152

Open lec49 opened 5 years ago

lec49 commented 5 years ago

Hi, I have a question about the 'abundance' column on the centrifuge classification summary output.

The manual says: 'it is the proportion of this genome normalized by its genomic length'

Can someone clarify in more simple terms exactly what this means?

My understanding was it was the length of all the reads that map to that genome divided by the total length of the genome. If you have reads that map to the whole genome you will get a score of 1.0? Is this correct or does it mean something else?

Thanks,

Lyndsay

t-neumann commented 5 years ago

I have the same question actually. Contacted the developers via email and no answer.

It would be really helpful to get some formula and this clarified...

pxo289 commented 5 years ago

I've got a somewhat related issue, in that I get loads of classifications for which the centrifuge summary indicates there are thousands of reads (and unique reads as well since I restricted it to scoring down to 1 classification) which have an abundance of just 0. Looking at the equation in the paper for calculating abundance I'm not sure why there would ever be anything it says is present at all with an abundance of exactly 0. There doesn't seem to be a pattern to it either, includes eukaryotes and prokaryotes, species level through kingdom level classifications.

ExplodingCabbage commented 5 years ago

@pxo289 I see the same output as you, where everything gets an abundance of 0 despite a large proportion of all reads being uniquely assigned to one organism. There's a bug somewhere, evidently. Note that there's a separate issue tracking it: https://github.com/DaehwanKimLab/centrifuge/issues/158

@lec49, I'll also attempt answer your original question, based on my best attempting at understanding the paper at https://genome.cshlp.org/content/26/12/1721.full.pdf, with the caveat that, as noted above, I've not actually got Centrifuge to work on my reads yet.

My understanding was it was the length of all the reads that map to that genome divided by the total length of the genome.

No - the length of reads is irrelevant. It's the count of reads that matters.

Rather, the abundance values attempt to convey what proportion of organisms in a sample were from each species. That calculation is informed by both the number of reads assigned to each species and the lengths of those species' reference genomes. The key formula is this one from the end of page 4:

screenshot of likelihood formula from paper

This gives the likelihood of a given configuration of abundances, α, given a set C of assignments of reads to species. (Equivalently, then, it's the probability of a set of assignments C of reads to species given a set of abundances.) From the formula we can reverse-engineer Centrifuge's assumptions:

That second assumption is where the normalized by its genomic length idea comes in.

Subject to those assumptions, Centrifuge uses the screenshotted formula to find the highest-likelihood configuration of abundances.

aistBMRG commented 5 years ago

Using Illumina data from mock communities, also often found unexpected abundance estimates, many reads mapped uniquely but very low or zero estimated abundances. Seems that the expectation maximization may be the cause of this. Could the developers please comment on this?

Thanks.

harisankarsadasivan commented 5 years ago

Any developments with relation to this thread?