dpeerlab / SEACells

SEACells algorithm for Inference of transcriptional and epigenomic cellular states from single-cell genomics data
GNU General Public License v2.0
142 stars 26 forks source link

Computing SEACells on individual samples #40

Closed learning-MD closed 1 year ago

learning-MD commented 1 year ago

@sitarapersad et al., thanks for this incredibly interesting package! I'm really excited about metacells, cluster-free DE, etc., coming through in scRNA-seq analyses to better dissect true biological differences between health states rather than risking blunting both technical and biological differences with current integration methods.

As someone who mostly uses R rather than Python, I was hoping to get some guidance on the following:

  1. For the COVID example, did you fully process each individual sample (e.g., removing low quality cells/doublets, normalization, feature selection, and dim. reduction with cluster annotation) before running SEACells on them? And then merge the 20 .h5ad files together in order to generate the aggregated metacell x gene expression matrix? Was curious, as there were already cell state labels in the vignette (https://github.com/dpeerlab/SEACells/blob/main/notebooks/SEACell_COVID_integration.ipynb). The reason I ask is that I currently have a fully integrated dataset from ~22 samples from the Multiome kit (spanning two conditions: healthy and disease) - a Seurat object where I did batch correction using Harmony with each individual patient sample as a batch. Figure 6 in your paper was really illuminating, leading me here. I wanted to just work with the RNA side first to get the hang for this tool. I'm wondering whether I can subset my 22 patient Seurat object into individual samples and run the initial SEACells on each one individually without needing to re-run all the pre-processing. Happy to start from scratch, if that's your recommendation however. EDIT: I assume the approach would be the same for ATAC in the paired Multiome dataset?

  2. If different from the above, could you please share the pre-processing code that you used on each individual sample?

  3. Would you happen to have example scripts of how you used meta2cells in downstream analyses? For example, I'm interested in performing DEG analysis using 1) voom-limma and 2) miloDE (for "cluster"-free DEG) and wondering how I should modify my typical input - comparing across individual meta2cells rather than the clusters that I'm used to?

Thanks!

sitarapersad commented 1 year ago
  1. Yes, for the COVID example we processed each sample individually full before running SEACells. Say we have 20 samples, we now have 20 summarized SEACell anndatas. These aggregated anndatas were then merged and batch-corrected, and SEACells was run once more to generate SEA2Cells. I think that if you already have a large aggregated single-cell anndata, it's fine to run SEACells on this this without re-running all the preprocessing. However, we observed that running SEACells on each individual sample led to less strong batch effect, so that in some sense there was less 'work' for the batch correction algorithm to adjust the PCs if we ran SEACells first.

  2. I will try to look for some of these scripts for Milo (don't have any for voom-limma, sorry). But essentially, we treat the meta2cells in place of the clusters you would typically use!