MenonLab / Celmod

A computational approach to extrapolate cell type proportion estimates using matched bulk data and cell type counts
3 stars 0 forks source link

Usage questions and clarifications #1

Open JeGrundman opened 4 months ago

JeGrundman commented 4 months ago

Hi,

Thank you for providing this software. I have some questions about its usage that I was hoping you might be able to answer.

  1. In the tutorial, bdat_initial is the gene x sample matrix for the bulk data. classprops_initial appears to be the cell proportions for the bulk x cell type. Do these proportions come directly from the single cell data, where I’m assuming the bulk data proportions are just the cell type proportions from the same sample’s single cell data?

  2. The vignette seems to apply predict_estimates.R to bdat_initial, but I thought the idea behind this package was to create a model from matched single-cell and bulk data. I assume that the vignette does this for convenience but that the actual run would not be bdat_initial, but only the bulk data without single cell matches. Just confirming this is correct.

  3. For the genes in bdat_initial, what exactly are these? In Cain et al 2023, you describe 5 steps for Celmod:

    1. a filtering process where it seems like for each cell type, genes are filtered to be only those with counts > 100 and mean cpm > 10.
    2. Linear regression on each gene for each cell cluster
    3. Predicting the proportion of each cell type
    4. Ranking genes by 90th percentile for each cell type
    5. Selecting top genes

I assumed that step 1 needed to be done before creating the bdat_initial object, as I don’t see those filtering steps in the code for train_model.R, and that 2–5 were done in train_model.R. But since each gene is filtered per cell type, are the genes in bdat_initial (and in the ultimate bulk dataset you predict on) supposed to be the union of all the genes that passed filters for each cell type? If I’ve misunderstood any of these steps, please let me know!

Thanks again!

MenonLab commented 4 months ago

These are good questions - responses below:

  1. Correct, the input matrices for the training are the bulk expression data and the single cell-derived proportions for those same bulk samples. In the paper, we used single-cell RNA-seq derived proportions, but these could also come from flow cytometry or tissue based studies (IHC/ISH), if those are also available for the matched bulk samples.
  2. Also correct - the prediction should be run on the bulk samples for which there are no matched single-cell proportions. In the vignette, we used the same matrix for convenience, but we will make a note to clarify the actual use case. Thanks for pointing this out.
  3. The genes in the training bulk data set can be filtered in any suitable way - in Cain et al. 2023, we used a filtering based on counts and mean CPM, but this is not required for Celmod to run. This pre-filtering is not included in Celmod, but it is recommended that users try Celmod with different pre-filterings to assess robustness of predictions. The filtering can be done on the bulk RNA-seq data set or on each cell type-specific data set (from single-cell data). In the paper, we predicted the subclusters in each major class separately, so we used a different gene set for astrocytes, oligodendrocytes, glutamatergic neurons, etc. based on single-cell data. If single-cell expression data is not available, only the bulk can be used for filtering. However, this is not required, but rather at the discretion of the user, based on any prior knowledge of the single-cell or bulk data sets.

Please let us know if you have additional questions.

JeGrundman commented 4 months ago

Thank you! I appreciate your help.