IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.

https://umgear.org

GNU Affero General Public License v3.0

14 stars 4 forks source link

Analyze derived single-cell workbench clusters in a reference spatial dataset #890

Open adkinsrs opened 2 months ago

adkinsrs commented 2 months ago

From Chris Shults (+ my edits for gEAR purposes)

gEAR would be most useful for users to compare their scRNASeq or scATACSeq data to for cell annotation. They should directly take the unlabeled clusters derived from the single-cell workbench and project them onto a reference spatial dataset that is already annotated to determine cell types.

From Wei Song

Easy comparison or projection between spatial data and traditional scRNA-seq data, to match corresponding cell types, and to highlight marker gene expression, DEGs and so on.

(from me now)

I think the way this could work would be – DEGs would be calculated for each cluster against the remaining clusters in the "compare genes" step of the workbench. This will give us a list of unweighted genes per cluster (or a labeled gene collection as per #598). That list could then be applied with ProjectR to the reference dataset. All of this could be automated so that gEAR outputs a downloadable file (h5ad, Seurat object for R, etc.) for the user to take back to RStudio for downstream analysis.

adkinsrs commented 2 months ago

I imagine that this work could be on a new page where you are required to import a saved analysis that has the "compareGenes" step complete. The single-cell workbench could include a linkout to a new page, automatically passing along the analysis ID information as well.

We would also need to figure out which projectR algorithm to use as well. I believe any of them are feasible but a default needs to be decided (I guess PCA).

adkinsrs commented 1 month ago

attaching image Chris Shults had in the original email as well

adkinsrs commented 1 month ago

https://scanpy.readthedocs.io/en/stable/tutorials/spatial/integration-scanorama.html#data-integration-and-label-transfer-from-scrna-seq-dataset

Useful for transferring cell type labels from the single-cell dataset to the spatial dataset. Has some custom functions that we could modify for our needs

https://scanpy.readthedocs.io/en/stable/tutorials/spatial/basic-analysis.html

Leaving these here for potential reference

adkinsrs commented 1 month ago

ProjectR-based workflow

Click a "save binary gene list" button after the Find Marker Genes step. This will save the DE genes and values of the top N results for each cluster (vs rest)
- This will have to be built. This could also be the solution to #598.
Click link to go to spatial analysis page. This will bring the dataset ID and gene list ID
User will select spatial dataset.
Calculate QC metrics and identify highly variable genes from spatial dataset (like in the sc workbench)
ProjectR runs.
- Normally the output of projectR is samples x patterns and we substitute the patterns for the genes. However, our goal is to associate a cell cluster with each sample, so our existing plots do not address that.
Create a scanpy spatial plot. The "color" value of each cell will be the cluster label that has the maximum weight among all the clusters for that observation/cell.
- There is the caveat that any other strong candidates for cell identity are ignored since we only take the top, or most likely, value.
- Store most likely cell type in the adata.obs dataframe. Also store some p-value metric.
Add supplementary graphs as desired.
User clicks button to download h5ad or a converted Seurat object.

Step 4 will occur via cloud run function. Step 5 could potentially be a serverless function as well depending on the memory requirements.

adkinsrs commented 1 month ago

Scanorama-based workflow

Click a "save binary gene list" button after the Compare Genes/Clusters step. This will save the DE genes and values for each cluster (vs rest)
Click link to go to spatial analysis page. This will bring the dataset ID and gene list ID
User will select spatial dataset.
Calculate QC metrics and identify highly variable genes from spatial dataset (like in the sc workbench)
Run scanorama.correct_scanpy which will integrate and perform batch correction on the single-cell and spatial datasets.
Concatenate the datasets to create a common embedding between the two. Need to select options to ensure outer-join mechanics are used to keep what we need from both datasets. This returns an AnnData object which we could optionally save.
- Possibly run this step serverless
Compute distances between samples. These will serve as weights to be used for for propagating labels from the scRNA-seq dataset to the spatial dataset.
Assign a cluster label to each spatial cell by identifying the maximum weight from all clusters for each cell.
- This slightly differs from the tutorial, where the weights for each cell type label are passed to adata.obs, and the weights by each cluster are plotted. This is certainly an optional plot as well.
Create a scanpy spatial plot. The "color" value of each cell will be the cluster label that has the maximum weight among all the clusters for that observation/cell.
Add supplementary graphs as desired.
User clicks button to download h5ad or a converted Seurat object.

Many of these steps are redundant to the ProjectR workflow, with the biggest difference being the tool used. To me, the distance calculations would be the only potential memory-heavy stepl

adkinsrs commented 1 month ago

Some example supplementary graphs after identifying cell type labels. It may also be worth just storing the weights per cluster as observation columns as well, so that we can take advantage of the extra data.

Neighbor enrichment analysis between clusters
Ripley's statistic to determine if clusters have a random, clustered, or dispersed pattern at given scales
Moran's I-score to show random, clustered, or dispersed patterns for a selected gene.
Centrality scores
- closeness centrality - measure of how close the group is to other nodes.
- clustering coefficient - measure of the degree to which nodes cluster together.
- degree centrality - fraction of non-group members connected to group members.

adkinsrs commented 1 month ago

https://spatialdata.scverse.org/en/stable/tutorials/notebooks/notebooks/examples/squidpy_integration.html

Showcases neighborhood enrichment + spatial visualization of clusters (though we would use the single-cell clusters instead of deriving them from the spatial dataset).

adkinsrs commented 1 month ago

The Fertig lab also requested a similar "projectR->transfer clusters to second dataset" feature in gEAR as well. This may also loosely overlap with #411

adkinsrs commented 1 month ago

We can also run leiden clustering on the spatial data itself and do a cross-modality clustering comparison for some validation.