gevaertlab / sequoia-pub

SEQUOIA: Digital profiling of cancer transcriptomes with grouped vision attention
https://sequoia.stanford.edu
MIT License
10 stars 3 forks source link

:evergreen_tree: SEQUOIA: Digital profiling of cancer transcriptomes with linearized attention

Abstract

Cancer is a heterogeneous disease requiring costly genetic profiling for better understanding and management. Recent advances in deep learning have enabled cost-effective predictions of genetic alterations from whole slide images (WSIs). While transformers have driven significant progress in non-medical domains, their application to WSIs lags behind due to high model complexity and limited dataset sizes. Here, we introduce SEQUOIA, a linearized transformer model that predicts cancer transcriptomic profiles from WSIs. SEQUOIA is developed using 7,584 tumor samples across 16 cancer types, with its generalization capacity validated on two independent cohorts comprising 1,368 tumors. Accurately predicted genes are associated with key cancer processes, including inflammatory response, cell cycles and metabolism. Further, we demonstrate the value of SEQUOIA in stratifying the risk of breast cancer recurrence and in resolving spatial gene expression at loco-regional levels. SEQUOIA hence deciphers clinically relevant information from WSIs, opening avenues for personalized cancer management.

Overview

Fold structure

System requirements

Software dependencies and versions are listed in requirements.txt

Installation

First, clone this git repository: git clone https://github.com/gevaertlab/sequoia-pub.git

Then, create a conda environment: conda create -n sequoia python=3.9 and activate: conda activate sequoia

Install the openslide library: conda install -c conda-forge openslide==4.0.0

Install the required package dependencies: pip install -r requirements.txt

Finally, install Openslide (>v3.4.0)

Expected installation time in normal Linux environment: 15 mins

Pre-processing

Scripts for pre-processing are located in the pre-processing folder. All computational processes requires a reference.csv file, which has one row per WSI and their corresponding gene expression values. The RNA columns are named with the following format 'rna_{GENENAME}'. An optional 'tcga_project' column indicates the TCGA project that data belongs to. See examples/ref_file.csv for an example.

Step 1: Patch extraction

To extract patches from whole-slide images (WSIs), please use the script patch_gen_hdf5.py. An example script to run the patch extraction: scripts/extract_patch.sh

Note, the --start and --end parameters indicate the rows (WSIs) in the reference.csv file that need to be extracted. This is useful to execute the script in parallel.

Step 2: Obtain resnet/uni features

To obtain resnet/uni features from patches, please use the script compute_features_hdf5.py. The script converts each patch into a linear feature vector.

Note: if you use the UNI model, you need to follow the installation procedure in the original github and install the necessary required packages.

An example script to run the patch extraction: scripts/extract_resnet_features.sh

Step 3: Obtain k-Means features

The next step once the resnet/uni features have been obtained is to compute the 100 clusters used as input for the model. They are computed per slide, so it is pretty straightforward, and it is pretty fast.

An example script to run the patch extraction: scripts/extract_kmean_features.sh

Expected run time: depend on the hardware (CPU/GPU) and the number of slides

Pre-training and fine-tunning

Step 4 (Optional): pretrain models on the GTEx data

To pretrain the weights of the model on normal tissues, please use the script pretrain_gtex.py. The process requires an input reference.csv file, indicating the gene expression values for each WSI. See examples/ref_file.csv for an example.

Step 5: Train or fine-tune SEQUOIA on the TCGA data

Now we can train the model from scratch or fine-tune it on the TCGA data. Here is an example bash script to run the process: scripts/run_train.sh

The parameters are explained within the main.py file.

Some points that we want to emphasize:

Benchmarking

For running the benchmarked variations of the architecture:

Evaluation

Pearson correlation and RMSE values are calculated to compare the predicted gene expression values to the ground truth. The significantly well predicted genes are selected using correlation coefficient, p value, rmse, and by statistical comparisons to an untrained model with the same architecture.

Evaluation script: evaluation/evaluate_model.py. Output: three dataframes all_genes.csv: contains evaluation metrics for all genes, sig_genes.csv: metrics for only the significant genes and num_sig_genes.csv contains the number of significant genes per cancer type with this model.

Spatial gene expression predictions

Scripts for predicting spatial gene expression levels within the same tissue slide are wrapped in: spatial_vis

License

© Gevaert's Lab MIT License