GfellerLab / EPIC

Repository for the R package EPIC, to Estimate the Proportion of Immune and Cancer cells from bulk gene expression data.
https://gfellerlab.shinyapps.io/EPIC_1-1/
Other
71 stars 21 forks source link

other cells in case of PBMC data #4

Closed pbalajiv closed 5 years ago

pbalajiv commented 5 years ago

Hi Thank you for this fantastic work. I had several questions while using the package.

  1. I'm trying to deconvolve PBMC derived bulk RNAseq data (14000 genes after ID mapping, 0's removal, etc) over different time points and would like to understand the population dynamics. I used "BRef" as my reference profile. I arrived at other cells being around 60-70 % of the total cell population. I expect it to be much lower than that since my data is from PBMC. What do you think about this? Should I use the option to ignore other cell populations in the EPIC wrapper?
  2. What are the importance of siggenes and how are they useful? I'm sorry if this question is trivial, but I'm not able to understand why they exist as only a list of names in the reference profile.
  3. My dataset has many missing genes (using only 14000 genes while the reference profile has about 49000 genes). How are these missing values accounted for in the algorithm?

Thanks in advance for the answers.

jracle85 commented 5 years ago

Dear pbalajiv, Thank you for your interest and question. And sorry for the delay, I had missed this question...

So to reply to your questions:

  1. Indeed in PBMC samples we would except the "other cells" to be of much lower proportions as the reference profiles contain the most important cell types. Here some suggestions of what could be the cause of it:

    • Does your data correspond to TPM / RPKM (which is needed) or did you use some other type of normalization? Importantly, your counts should not be log-transformed.
    • I don't know what RNA-seq technology you used, but if you have many 0 (or nearly 0) that are not true 0 but that were kept then this could be an issue (you write that you removed them, but are there still multiple genes that were kept because the value was non-zero in some samples but still 0 in many other samples?). See my answers to question 2 and 3 below.
    • Depending on the condition, maybe some other cell types are still also present in your PBMC.
    • Or maybe some of the cell types go activated and show a very different gene expression profile than in their standard form? In principle the genes we use are expressed at a quite stable level in different conditions, but maybe there are some special cases where this doesn't hold anymore.
  2. The signature genes are important because they tell which genes need to be used when doing the deconvolution. EPIC is first using the full set of genes found in the intersection of the bulk samples and of the reference profiles to do an initial normalization based on the library size remaining from this genes' intersection. But then EPIC only keeps the subset of genes defined in the signature genes (both from the bulk samples and the reference profiles). It will then do a least-square optimization on this subset of signature genes in order to estimate the proportions of the various cell types.

  3. As explained in response Nr. 2, EPIC will consider the intersection of the genes defined between the bulk and reference profiles to do a normalization based on this intersect.

    • If you have missing values and the corresponding gene names are removed from your bulk, it should be fine and EPIC will consider only this smaller set of genes, it won't do any interpolation for the missing genes (but please check that the signature genes are part of your remaining genes (or at least most of the signature genes), because otherwise it won't really work accurately).
    • On the other hand, if you kept your missing genes and set a value of 0 to these genes, then this is an issue as EPIC will consider these genes are really expressed at a value of 0, distorting the expression from all the genes.
    • As a side note, the 49000 genes from the Bref contain in fact many genes also expressed at a value of 0. In fact, in general, a sample would contain about 20000 different genes, but some microRNA names or other genes were kept in these reference profiles and this is the reason why there are so many different genes there.

I hope this helps and that you find why you have this large amount of other cells in your PBMC data.

Best wishes,

Julien