amitfrish / scBio

Single Cell Genomics for Enhancing Cell Composition Inference from Bulk Genomics Data
21 stars 7 forks source link

About inputs #18

Closed Zhaohui-Ruan closed 2 years ago

Zhaohui-Ruan commented 3 years ago

Hi Amit, I found that the example data that you provided are non-zero integers. Does Scaden require raw count as inputs? Is there any preprocessing step that you did to remove 0s?

I only have one group in my data and I want to use CPM to infer the absolute fractions for each cell type in bulk data. I have a few questions:

  1. In the example data you provided (data(SCFlu) and data(BulkFlu)), scRNAseq data and bulk data are very small. So I have some questions about data CPM required.
    • Are both of scRNAseq matrix and bulk data should be normalized data?
    • Are both scRNAseq matrix and bulk data should be in log scale?
  2. If both scRNAseq matrix and bulk data should be in log scale, can I just run cpm() without running the following command? BulkFluAbs = exp(BulkFlu)-1 SCFluAbs = exp(SCFlu)-1

Ruan

amitfrish commented 3 years ago

Hey Ruan, Sorry for the late response, for a strange reason I didn't get a notification on your issue. In general, CPM was designed for cell state deconvolution and I'm not sure why you chose it specifically in your case. It was not described as the best method for fine-tuned calculation of cell type abundance. I'm not sure what is the question about Scaden. This is another deconvolution algorithm and it's not related to CPM at all. The data that we provided is actually normalized and not raw counts and it contains many zeros... Since you only have one group, you should run CPM in an absolute (non-relative) scenario, therefore using linear scale data and not log-scaled.

Regarding your other questions:

  1. CPM can work with both the raw counts and the normalized values. However, it's important to use the same normalization for both the single cell and the bulk data. Usually, linear scale is preferred for most deconvolution algorithm, as well as CPM, however, in a relative scenario (described in the readme; not your situation) it is better to log-normalize the data.
  2. Yes. The example data was prepared for a relative scenario so running a non-relative one required the exp transformation of the data. If you run it in a relative scenario, you should use the log-scaled data.

I hope this helps. Tell me if you need any additional help, Amit