amitfrish / scBio

Single Cell Genomics for Enhancing Cell Composition Inference from Bulk Genomics Data
21 stars 7 forks source link

Confused in input #5

Closed QqQss closed 5 years ago

QqQss commented 5 years ago

Hi, thanks for your great tool!

I have some questions about the input data. Should I input pre-normalized values (e.g. TPM / FPKM / CPM or their log values) or just raw counts for both bulk and singlecell profile?

One more question, if my single cell profile was generated by two platforms (e.g. some cells generated by 10X and the others are DropSeq), and my cell-state space (UMAP/tSNE coordinate) was obtained using other integrating analysis software, such as MNN and BBKNN. Dose this inputs (uncorrected RNA profile and corrected cell-state space) can affect the CPM result? if the answer was yes, what should I do?

QqQss commented 5 years ago

One maybe important thing, according your published paper, you did deconvolution using linear regression: U=∑i Ri ⋅ βi

In most cases, people did scRNAseq using 3'/5' tail and UMI based method, such as 10X technique. But for bulk RNAseq, we usually did full-length RNA sequencing, that is, we would get every part of a intact mRNA.

So in these cases, there are different measures for bulk and single cell sequencing. Is that simple linear regression could work well?

amitfrish commented 5 years ago

Well, it is hard for me to answer all these questions since I didn't test that at the paper. What you are talking about is a common problem to all deconvolution methods. So far, most people used microarray based reference data sets to deconvolve RNA-seq based bulk data. What I can tell you is that in our paper we had SC data completely different from the bulk data and it worked great. The linearity is important since we assume that each cell has a pool of RNA reads and we want to find the best combination of cells to explain the total pool in the bulk data. For your first question, you want to be able to compare the predictions between samples so use whatever RNA normalization type that you trust to allow this (either TPM / FPKM or CPM). Regarding log-scale, for absolute prediction we suggest on linear scale and for relative on log scale (look at the main page for a more detailed explanation). The most important thing I want to emphasis is the cell space. You should choose the cell space according to what you want to check. You can choose high variable genes to get a general one but also choose activation genes to focus on activation, or metabolic genes to focus on metabolism. For each cell space you will choose, you will get different predictions by CPM since the results will reflect different aspect of the cells. For example, there may be more activated cells so looking at activation will provide dramatic changes in cell quantities but if metabolism is not relevant to the process you won't see any difference in cell quantities, using a cell space focused on metabolic genes. The reason I'm writing this is that you should choose the cell space for each cell type based on what you prefer and if you believe in the integration there is no problem in using it.

QqQss commented 5 years ago

Thanks for your quick reply! One sentence you mentioned "choose the cell space for each cell type based on what you prefer", that means I cannot run CPM for all cell types at once? What I want is to find out the proportion of every cell types in bulk tissue, according the reference SC profile.

amitfrish commented 5 years ago

You should run CPM with all cell types at once. What I meant is that CPM calculates the cell quantities within cell types so the cell space of every cell type should be the best as possible. CPM looks for neighbor cells within cell types and doesn't combine the cell spaces of different cell types.

QqQss commented 5 years ago

OK, I got. I'll try it and thank you again!