DE (notes from small group meeting):
FindMarkers or FindConservedMarkers or FindAllMarkers
FindMarkers and FindAllMarkers used routinely, usually at cell type annotation level, less so FindConservedMarkers
FindAllMarkers is for each cluster vs the rest of the clusters, FindMarkers is for one cluster vs one other cluster
MAST or wilcox test possibly more useful for cluster1 v cluster2 comparisons than FindMarkers but need replicates
Can show only upregulated markers as opposed to both up and down
Violin plots, dotplots, UMAPs, or seurat DoHeatmap to show expression of markers across clusters, depending on number of features
Calculate additional metric using results of FindMarkers or FindAllMarkers: difference between pct1 and pct2 multiplied by pct1. Prioritizes markers with big difference and high frequency of expression
No one has yet used non-default statistical tests for these functions
Run these functions on RNA assay, other discussion in non-HCBC community
MAST: run with MAST function, not inside Seurat ( add sample id (random effect 1| sample_id) and number of genes (nGenes) in the model and any other variables as you would for bulk RNAseq)
Need to run MAST independently, Seurat runs it wrong
Formula for MAST: number of genes expressed in every cell, sample_id. If sample_id not included, pseudo-replicating = bad. Include any other variables of interest-sex, genotype, age, etc
Replicates not strictly necessary b/c designed for low-replicate sc data. In contrast to DESeq pseudo bulk, which can be driven by a handful of cells expressing the gene.
If only a few cells express at high levels, will get false positive in pseudobulk and true negative in MAST, need to plot pseudobulk results to find this
Pseudobulk more conservative, MAST more sensitive, need more info to make sense of MAST results
Filter for % of cells expressing a gene, remove gene if <10%
No particular requirements for n cells per cluster as long as results reasonable
Worth running non-interactively as sbatch script on O2 w/ up to 250G memory, is slower and runs out of memory in interactive, although can try with Future package
Emma to supply sbatch script for running MAST
If no replicates but comparing between conditions, use MAST but dont trust p values, potential pseudoreplicate issues. If comparing between clusters, use FindMarkers or FindAllMarkers
pseudobulk: 2 reps, > 10 cells per cluster and sample. Minimum cells?
Do only for clusters of scientific interest, removes temptation of desired but potentially unreliable results based on small clusters
10-50 cells: use correlation analysis to decide if results are reliable. From zhu: https://files.slack.com/files-tmb/T02AQJ7QD-F07G17NF4VB-cdc36f6d37/image_720.png
Seurat pseudobulk script from Noor, Zhu, James. Who is simpler??
SingleCellExperiment code from Amelie
plot genes at single cell level. What to plot? SCT, DESEQ2 normalized? Log?
DESeq2 normalized expression at pseudobulk level for top genes, log2 normalized for violin plot. SCT creates artifacts? SCT is harder to interpret b/c each gene has its own residual, is not continuous
Comparison of MAST vs pseudobulk, most conservative approach is taking the overlap.
DE (notes from small group meeting): FindMarkers or FindConservedMarkers or FindAllMarkers FindMarkers and FindAllMarkers used routinely, usually at cell type annotation level, less so FindConservedMarkers FindAllMarkers is for each cluster vs the rest of the clusters, FindMarkers is for one cluster vs one other cluster MAST or wilcox test possibly more useful for cluster1 v cluster2 comparisons than FindMarkers but need replicates Can show only upregulated markers as opposed to both up and down Violin plots, dotplots, UMAPs, or seurat DoHeatmap to show expression of markers across clusters, depending on number of features Calculate additional metric using results of FindMarkers or FindAllMarkers: difference between pct1 and pct2 multiplied by pct1. Prioritizes markers with big difference and high frequency of expression No one has yet used non-default statistical tests for these functions Run these functions on RNA assay, other discussion in non-HCBC community MAST: run with MAST function, not inside Seurat ( add sample id (random effect 1| sample_id) and number of genes (nGenes) in the model and any other variables as you would for bulk RNAseq) Need to run MAST independently, Seurat runs it wrong Formula for MAST: number of genes expressed in every cell, sample_id. If sample_id not included, pseudo-replicating = bad. Include any other variables of interest-sex, genotype, age, etc Replicates not strictly necessary b/c designed for low-replicate sc data. In contrast to DESeq pseudo bulk, which can be driven by a handful of cells expressing the gene. If only a few cells express at high levels, will get false positive in pseudobulk and true negative in MAST, need to plot pseudobulk results to find this Pseudobulk more conservative, MAST more sensitive, need more info to make sense of MAST results Filter for % of cells expressing a gene, remove gene if <10% No particular requirements for n cells per cluster as long as results reasonable Worth running non-interactively as sbatch script on O2 w/ up to 250G memory, is slower and runs out of memory in interactive, although can try with Future package Emma to supply sbatch script for running MAST If no replicates but comparing between conditions, use MAST but dont trust p values, potential pseudoreplicate issues. If comparing between clusters, use FindMarkers or FindAllMarkers pseudobulk: 2 reps, > 10 cells per cluster and sample. Minimum cells? Do only for clusters of scientific interest, removes temptation of desired but potentially unreliable results based on small clusters 10-50 cells: use correlation analysis to decide if results are reliable. From zhu: https://files.slack.com/files-tmb/T02AQJ7QD-F07G17NF4VB-cdc36f6d37/image_720.png Seurat pseudobulk script from Noor, Zhu, James. Who is simpler?? SingleCellExperiment code from Amelie plot genes at single cell level. What to plot? SCT, DESEQ2 normalized? Log? DESeq2 normalized expression at pseudobulk level for top genes, log2 normalized for violin plot. SCT creates artifacts? SCT is harder to interpret b/c each gene has its own residual, is not continuous Comparison of MAST vs pseudobulk, most conservative approach is taking the overlap.