FrederickHuangLin / ANCOMBC

Differential abundance (DA) and correlation analyses for microbial absolute abundance data
https://www.nature.com/articles/s41467-020-17041-7
103 stars 26 forks source link

Correct input for metagenome assembled genome abundance #196

Closed Somebodyatthdoor closed 9 months ago

Somebodyatthdoor commented 1 year ago

Hi,

I am aware that it is recommended that "raw" data is used as input for ANCOM and ANCOMBC. In the context of amplicon data it is clear that this means count data that has not been normalised for sequencing depth. But when analysing the abundance of metagenome assembled genomes (MAGs) between groups it is a bit more tricky, and I am not sure what the best approach would be. Usually the abundance of MAGs is calculated by mapping reads back to MAGs. This is followed by normalisation steps that attempt to correct for things such as gene length, genome size, library size etc. These normalisation steps are required, otherwise you risk biasing your dataset and making it seem as though larger genomes are more abundant in your samples.

As such, abundance data is often shown not as counts, but as percentage relative abundance, RPKM, TPM etc. (eg. using tools such as coverm: https://github.com/wwood/CoverM). Personally I have a preference for TPM.

I noticed a few papers that seemed to use TPM and ANCOMBC, but would using such data cause some problems as ANCOMBC assumes the data is in a raw state?

Cheers, Laura

FrederickHuangLin commented 11 months ago

Hi Laura,

That's a great question. As you mentioned, RPKM and TPM are generated by normalizing for both gene length and sequencing depth. While normalizing by gene length shouldn't significantly affect ANCOM-BC (or ANCOM and ANCOM-BC2), normalizing for sequencing depth is something we aim to avoid since we consider it informative. In cases where sequencing depth normalization is applied, ANCOM-BC methods might experience a loss of statistical power, although I don't believe it would lead to issues with controlling the FDR.

It's important to note that ANCOM-BC series methods were not specifically benchmarked for these types of data. So, if you choose to use them in this context, I recommend proceeding with caution and being mindful of the potential limitations.

Best regards, Huang