cmzuo11 / stMVC

11 stars 0 forks source link

Elucidating tumor heterogeneity from spatially resolved transcriptomics data by multi-view graph collaborative learning.

image

Overview of stMVC model. (a) Given each spatially resolved transcriptomics (SRT) data data with four-layer profiles: histological images (I), spatial locations (S), gene expression (X), and manual cell segmentation (Y) as the input, stMVC integrated them to disentangle tissue heterogeneity, particularly for the tumor. (b) stMVC adopted SimCLR model with feature extraction framework from ResNet-50 to efficiently learn visual features (h_i) for each spot (v_i) by maximizing agreement between differently augmented views of the same spot image (I_i) via a contrastive loss in the latent space (l_i), and then constructed HSG by the learned visual features h_i. (c) stMVC model adopting a SGATE learned view-specific representations (〖P_i〗^1 and 〖P_i〗^2) for each of two graphs including HSG and SLG, as well as the latent feature from gene expression data by the autoencoder-based framework as a feature matrix, where a SGATE for each view was trained under weak supervision of the cell segmentation to capture its efficient low-dimensional manifold structure, and simultaneously integrated two-view graphs for robust representations (R_i) by learning weights of different views via attention mechanism. (d) Robust representations R_i can be used for elucidating tumor heterogeneity: detecting spatial domains, visualizing the relationship distance between different domains, and further denoising data.

Installation

Install stMVC

Installation was tested on Red Hat 7.6 with Python 3.6.12 and torch 1.6.0 on a machine with one 40-core Intel(R) Xeon(R) Gold 5115 CPU addressing with 132GB RAM, and two NVIDIA TITAN V GPU addressing 24GB. stMVC is implemented in the Pytorch framework. Please run stMVC on CUDA if possible.

1. Grab source code of stMVC

git clone https://github.com/cmzuo11/stMVC.git

cd stMVC

2. Install stMVC in the virtual environment by conda

conda create -n stMVC python=3.6.12 pip

source activate

conda activate stMVC

pip install -r used_package.txt

Install histological image annotation software (labelme)

Installation was tested on Windows 10 with Intel Core i7-4790 CPU, and the labelme software is available at Github: https://github.com/wkentaro/labelme

Install R packages

Quick start

Input

wget https://zenodo.org/record/7244758/files/stMVC_test_data.zip

unzip stMVC_test_data.zip

Note: The folder named 'DLPFC_151673' contains the raw data of slice 151673.

Run

Step 1. Preprocess raw data

This function automatically (1) learns 50-dimensional features from 2000 highly variable genes of gene expression data, (2) trains SimCLR model (500 iterations) by data augmentations and contrastive learning and extracts 2048-dimensional visual features from histological data, and (3) saves the physical location of each spot into a file 'Spot_location.csv' into a folder named spatial of the current directory.

python Preprcessing_stMVC.py --basePath ./stMVC_test_data/DLPFC_151673/ 

The running time mainly depends on the iteration of SimCLR training. It takes 3.7h to generate the above-described files. You can modify the following parameters to reduce time:

To reproduce the result, you should use the default parameters.

Note: To reduce your waiting time, we have uploaded our preprocessed data into the folder ./stMVC_test_data/DLPFC_151673/stMVC/. You can directly perform step 3.

Step 2. Manual cell segmentation (for IDC dataset)

This function defines the classification of each spot based on our manual cell segmentation by labelme software, and saves the cell segmentation file (Image_cell_segmentation_0.5.csv) into the 'image_segmentation' directory. It takes ~ 35 mins.

python Image_cell_segmentation.py --basePath ./stMVC_test_data/IDC/ --jsonFile tissue_hires_image.json

Note: To reduce your waiting time, we have uploaded the tissue_hires_image.json and the processed result from step 1 into a folder named IDC. You can directly perform step 3.

Step 3. Run stMVC model

This function automatically learns robust representations by multi-view graph collaborative learning. It takes ~7 mins for DLPFC_151673 and ~9 mins for IDC.

python stMVC_model.py --basePath ./stMVC_test_data/DLPFC_151673/ --fusion_type Attention

In running, the useful parameters:

To reproduce the result, you should use the default parameters.

Output

Output file will be saved for further analysis:

Further analysis

Some functions from R file named Postprocessing.R (in stMVC folder) are based on the file named GAT_2-view_robust_representation.csv for further analysis.

#Generate pdf file includes clustering and visualization 
library('Seurat')
library('ggplot2')
source(./stMVC/Postprocessing.R)
basePath       = "./stMVC_test_data/DLPFC_151673/"
robust_rep     = read.csv( paste0(basePath, "stMVC/GAT_2-view_robust_representation.csv"), header = T, row.names = 1)
Seurat_obj     = Seurat_processing(basePath, robust_rep, 10, 7, basePath, "stMVC/stMVC_clustering.pdf" )
#data denoising based on 15 nearest neighboring spots
input_features = as.matrix(robust_rep[match(colnames(Seurat_obj), row.names(robust_rep)),])
Seurat_obj     = FindVariableFeatures(Seurat_obj, nfeatures=2000)
hvg            = VariableFeatures(Seurat_obj)
rna_data       = as.matrix(Seurat_obj@assays$Spatial@counts)
hvg_data       = rna_data[match(hvg, row.names(rna_data)), ]

mat_smooth     = knn_smoothing( hvg_data, 15, input_features )
colnames(mat_smooth) = colnames(Seurat_obj)

#find spatially variable genes
Seurat_smooth         = CreateSeuratObject(counts=mat_smooth, assay='Spatial')
Idents(Seurat_smooth) = Idents(Seurat_obj)

Seurat_smooth = SCTransform(Seurat_smooth, assay = "Spatial", verbose = FALSE)
top_markers   = FindAllMarkers(Seurat_smooth, assay='SCT', slot='data', only.pos=TRUE) 

References

Citation

Chunman Zuo, Yijian Zhang, Chen Cao, Jinwang Feng, Mingqi Jiao, and Luonan Chen. Elucidating tumor heterogeneity from spatially resolved transcriptomics data by multi-view graph collaborative learning. Nature Communications. 2022.