Approach for extending TRef profile

Hi there,

I was thinking about the following approach and was wondering on your opinion:

To further refine the identification of cell fractions/ tissue fractions in my bulk RNAseq samples I thought about extending the TRef sample profiles with an additional profile for normal tissue (based on bulk RNAseq of the respective normal tissue) to thus get an even better estimation of tumor purity (= otherCells). So based on this idea my questions are as following:

Do you think such an approach is feasible?
The only thing I have to do is to add another column representing the normal tissue to TRef$refProfiles and add the respective gene markers to TRef$sigGenes ?
The marker genes for the normal tissue I would determine based on:

Importantly, we do not require our signature genes to be expressed in exactly one cell type, but only to show very low expression in cancer cells

and the extended "Cell marker gene identification" in the methods section?

I guess since the input matrix is in normalized TPM, this is also used for TRef$refProfiles ?

I'm very curious on your feedback!

Thank you!

Hello,

Thank you for your question! I think that such an approach is feasible but that it might be more difficult than it looks and I'm not sure the results will improve much. Note that it might also be difficult to define a normal vs cancer cell at the transcript level, some cancer cells might be at different development stages and might thus still look like normal cells. Here are some additional answers and thoughts to your questions.

This approach might be feasible but some warnings: the "normal cells" you want to add (you're probably talking about epithelial cells for example?) are certainly very similar to the cancer cells. I.e. the cancers cells developed from these cells and so the "basis gene expression" will be the same between both cell types with some genes that are overexpressed (or underexpressed) in the cancer cells. But there might be only few genes that you can find as signature genes of the normal cells (i.e. genes that would be expressed by these normal cells and not at all by the cancer cells). In a similar way, we discussed in the paper the estimation of Thelper vs Treg proportions and we saw that because these two cell types show a highly similar gene expression profile, the accuracy of the predictions was less good.
The most straightforward thing is indeed to add an extra column to TRef$refProfiles and add the normal cell gene markers to TRef$sigGenes. You could possibly also add another column to TRef$refProfiles.var to indicate the variability in each gene expression in these normal cells (if you don't add it, EPIC will not account for the variability in gene expression at all, but as we saw, it is only improving slightly the results so it isn't too much important). The new data should indeed be given in TPM normalized counts. However if you just do like this, there might be some batch effects. Indeed, the TRef profiles were obtained from single-cell RNA-seq data, which might still show some strong biases with respect to bulk RNA-seq. If all ref profiles are coming from the same type of data, then it worked ok, even for predicting bulk RNA-seq data as we observed in the paper. But if you mix together some reference profiles from single-cell data with reference from bulk and then predict bulk data, the batch effects might be stronger. Ideally, you would build all reference profiles from a same experiment, avoiding thus the batch effects, or build these ref profiles from multiple experiments, but checking that the batch effects are small (as we did for the circulating immune cell reference profiles). If combining from multiple experiments, you wouldn't need to have all cell types present in all experiments, but some cell types should ideally be in common, so that you could verify that these common cell types cluster indeed together.
Yes, you could use this to search for signature genes. Ideally, you'd do some differential expression analysis between the various cell types (and also if possible against sorted cancer cells) to search for genes that are well expressed by your normal cells but only at a negligible level in the cancer cells. These genes could be expressed by some other non-malignant cell types from the reference profiles, but if you can find genes that are expressed only by your normal cells, it is even better (or at least some of these genes should). In our analyses, we observed that 10-20 marker genes were usually a good number, but it might be different if your cell types are very similar.

I hope this helps. Please let me know if you try this approach and if you get something interesting (or not) from it.

Best wishes,

Julien

GfellerLab / EPIC

Approach for extending TRef profile #1