GfellerLab / EPIC

Repository for the R package EPIC, to Estimate the Proportion of Immune and Cancer cells from bulk gene expression data.
https://gfellerlab.shinyapps.io/EPIC_1-1/
Other
71 stars 21 forks source link

Can we used normalised counts as input? #9

Closed guray003 closed 1 year ago

guray003 commented 2 years ago

Hi.

I'm keen to use EPIC and read in the instructions that it only accepts TPMs or FKPMs. Is it possible to use normalised counts instead?

Thanks.

jracle85 commented 2 years ago

Hello,

Thank you for your question. I'm not sure to which "normalised" counts you are referring as there are multiple ways of normalizing them. But here is some general answer relating to what type of counts to use for EPIC:

EPIC (and the maths behind it) has been developed based on TPM normalization, so I would advise using TPM, if possible (FPKM/RPKM would work as well). In particular, the part of EPIC that transforms between the predicted mRNA fractions and predicted cell fractions works if the data is TPM normalized (because with this normalization a count of 1 for a given gene will be proportional to the number of copies of this mRNA in the sample, while when using other normalizations this would depend on the size of the given gene).

However, if it is sufficient to estimate the mRNA fractions (instead of cell fractions), then other normalizations should likely also work. But, please note that if you want to use another normalization, you then should ideally redefine the reference gene expression profiles, so that both the bulk samples and the reference profiles are based on the same normalization. If you don’t use the same normalization for bulk and reference, this would likely lead to biases in the estimated proportions. Here’s a little example explaining the problem: let’s imagine that for the reference gene expression profile of B cells, geneA has a TPM of 1 and geneB also 1; but that, based on the same data, another normalization gives values of 1 for geneA and 10 for geneB. Then, if you’d like to estimate the fraction of this pure B cell sample, giving as input to EPIC this other normalization value, and using the standard TPM values as reference, EPIC should in principle return that B cells are composing the sample at 100%, but it wouldn’t really be able to know this, maybe it would tell it is only 50% B cells, because it wouldn’t be able to make that the values for geneA and geneB fit very well the reference profile values at the same time. Note that this “issue” is not only present in EPIC, but any other deconvolution method based on gene expression reference profiles would have the same problem.

So to summarize: EPIC should work to predict the proportion of mRNA based on other normalization (with no warranty, I didn’t test it based on all possible normalizations), but you’d better redefine the reference profiles (you could build them based on the same data than for EPIC, the datasets used are publicly available and referenced in our publication as well as in the R package help documents (e.g. ?EPIC::TRef ; ?EPIC::BRef )).

Best wishes,

Julien