allind / EukDetect

MIT License
43 stars 16 forks source link

Relative abundance inference from coverage data #25

Closed mhyleung closed 2 years ago

mhyleung commented 2 years ago

Dear all

I was wondering if there is a way to infer the relative abundance of the detected taxa based on the coverage scores? From my understanding, the "total marker coverage" column in the output file is just the percentage of the marker bases aligned to the query fastq sequences?

Thanks

Marcus

allind commented 2 years ago

Hi Marcus, thanks for reaching out. There currently isn't a relative abundance calculation included as output for eukdetect, but it's a frequently requested addition and I will be releasing code shortly that will do this.

allind commented 2 years ago

Hi Marcus, there is now a new version of eukdetect that does calculate a proxy for relative abundance called "eukaryotic fraction". Since eukaryotes usually make up very small amounts of microbiome sequencing libraries, this number should also be considered alongside the proxy for absolute abundance, which is reported by eukdetect as "reads per kilobase of sequence" or "RPKS". Please reach out with any questions.

mhyleung commented 2 years ago

wow thank you so much for this. Just a note in case you want to update the installation part where the new database is downloaded. The new database now has the file name 34880610 according to FigShare (or at least that's the filename I have when I downloaded from the FigShare link. Thanks!

Marcus

allind commented 2 years ago

Hi Marcus,

Hmm, that's a previous database version - the most recent should be 34885596. What happens if you try to run wget https://ndownloader.figshare.com/files/34885596?

mhyleung commented 2 years ago

cool that works now. Thanks

mhyleung commented 2 years ago

Hi

I read on the github page that: "RPKS is calculated by dividing the reads aligned to markers by the length in Kb of all markers for a species."

So the number calculated from each taxa is related to the number (or length of all) markers in the database. I might be missing something obvious here, but that number would not be related to either the total number of reads in the sample?

So for example if I have the same number of reads that mapped to a specific taxon (e.g. 100 reads from each of sample A and B mapped to taxon_1 that contains total of x number of marker bases in the database). Even if sample A and sample B differ in 10x the number of total reads, the RPKS value will be the same for taxon_1, because the number of total marker bases for taxon_1 is the same?

Thanks

Marcus

allind commented 2 years ago

That's correct - RPKS is essentially depth of coverage and isn't scaled by library size. It's not directly comparable between samples for that reason, unlike the eukfrac number. The reason that it's included in the output is because depth of coverage, independent of library size, is an important metric to consider - in part to caution against drawing conclusions from differences in relative abundance between samples when the raw number of reads in the library is low.

mhyleung commented 2 years ago

Thank you for the explanation. So is there a way to compare the abundance (even in relative terms) between samples based on the EukDetect output, maybe by multiplying by the number of reads in the metagneome?

allind commented 2 years ago

Relative abundances (eukfrac) are directly comparable between samples, though in my opinion it masks important information when the amount of sequence in the library is low.

To compare absolute abundance (RPKS) between samples, look at the library size (read counts) of all the samples you're interested in comparing, and create a scaling factor you'll use to normalize. I would probably do this by choosing the median-sized library as a reference point, and figure out the scaling factor for each sample by dividing the median library size by each sample's library size. Then, multiply all RPKS values in each sample by the scaling factor.

To give an example, if you have samples A, B, and C, with library sizes of 25 million reads, 30 million reads, and 37.5 million reads: choose sample B as the reference library. Sample A's scaling factor is 1.2 (30 million / 25 million), Sample B's scaling factor is 1 (30 million / 30 million), and Sample C's scaling factor is 0.8 (30 million / 37.5 million). Multiply all RPKS in Sample A by the scaling factor (and Sample C by its scaling factor). Those are normalized RPKS values and can be compared.

rishibhandari63 commented 2 years ago

Can these normalized RPKS values be used to make a stacked barplot of relative abundance? I also wanted to see the alpha and beta diversity of these values. Can I use these values directly to create NMDS and alpha diversity?

allind commented 2 years ago

@bhandari63 The eukfrac metric is analogous to relative abundance and can be used in situations where you would use relative abundance, like a stacked bar chart. If you have samples that are relatively diverse in eukaryotic makeup and have a lot - like a deeply sequenced seawater, soil, or plant-associated microbiome - this is probably okay, though take a careful look at your data at all steps in the analysis. If you are working with something that is a less diverse community, like the human microbiome or an animal microbiome, I would suggest looking at your data taxon-by-taxon and not summarized with ordination plots or stacked bar charts.