Nanostring-Biostats / InSituType

An R package for performing cell typing in SMI and other single cell data
Other
22 stars 10 forks source link

obtaining reference profiles #194

Closed HelenaLC closed 5 months ago

HelenaLC commented 8 months ago

Dear developers, firstly, thanks for an installable, runnable, and documented tool - that is a rarity and highly appreciated!

We've run InSituType on data from 2 projects (different tissues), and preliminary results look promising. However, a couple questions came up with the (semi-)supervised mode - apologies in advance if I missed something in the paper and/or demos:

  1. The vignette states reference profiles should not be on a log-, but linear scale. Then again, the exemplary profiles contain non-integers. So I am assuming these are either average counts or summed normalized counts. Thus my question: How are these obtained exactly? I.e., how exactly are you pseudo-bulking the single-cell profiles?

  2. Related to the above: Our references contain data from multiple replicates. Can this be accounted for in any way (if so, how?), or are the profiles you provide from single-sample scRNA-seq data only? I believe, typically, this would be taken care of during, say, DGE analysis when identifying cluster signatures to use for label transfer. But this does not apply here, which is neat (!) but makes it less obvious how to account for a multi-sample reference.

Just to add that, so far, I have attempted the following (as inspired by previous work on pseudo-bulks), but am not entirely sure this is the way the go:

  1. sum (raw) counts by sample-cluster to obtain pseudo-bulks
  2. apply library size normalization to account for, e.g., cell counts
  3. average pseudo-bulks (normalized counts) across samples to obtain cluster-level profiles
patrickjdanaher commented 5 months ago

Late answers to questions:

  1. Yep, the example reference profiles contain non-integer values because they're calculated by averaging many cells per cell type. Only the counts matrix must be integers.
  2. Multi-sample references perform OK, especially if from the same platform. The ioprofiles reference is in fact a multi-sample reference (see the spatialDecon paper for details).
  3. How to derive a reference profile: use the new function "getRNAprofiles", or just perform the procedure described in the Issue above.