GlastonburyC / RNAPath

Self-supervised representation learning combining GTEx histology, RNA-seq and WGS
GNU General Public License v3.0
21 stars 4 forks source link

Self-supervised learning for characterising histomorphological diversity and spatial RNA expression prediction across 23 human tissue types

Francesco Cisternino, Sara Ometto, Soumick Chatterjee, Edoardo Giacopuzzi, Adam P. Levine, Craig A. Glastonbury

Published in Nature Communications.

DOI

WSI Preprocessing

1. Segmentation

Segmentation allows us to separate the tissue from background in WSIs. The output are binary masks (stored as .jpeg).

python preprocessing/segmentation_patching/segmentation.py

image

2. Tiling

The tissue region of WSI, identified by segmentation, is divided into small squared tiles (or patches) (e.g. 128x128); this allows us to process the WSI using GPUs and to obtain local (tile-level) results.

cd ./preprocessing/segmentation_patching
python tiling.py

image

3. Feature extraction

Tile images are turned into features vectors capturing their morphological content. To do this, we use a vision transformer (ViT-S) trained on 1.7 M histology patches using a self-supervised approach.

cd ./preprocessing/features_extraction
python extract_features.py

RNAPath

image

1. Training

RNAPath training requires patch features to represent WSIs, train/validation/test splits, a .txt indicating the list of genes to be profiled (example in ./resources/gene_set_example.txt) and a csv file with the genes TPMs. In this study, 4 different partially overlapping representations of the same slide are computed: to the original patch set, 3 other sets have been added by shifting the original one by 32x32, 64x64 and 96x96 pixels. During each training interation, a single representation (patch set) is randomly selected. In case of a different representation, the script ./datasets/dataset_generic.py should be modified. The RNASeq dataframe (rnaseq_complete.csv), available in the supplementary data (at the bottom of this README.md), should be placed in ./resources.

The training script requires some arguments to be set:

python train.py --exp_code test_0 --tissue_code HEA --data_root_dir /path/to/features/dir/

During training, training and validation loss values will be logged and a results folder will be created (inside results_dir) and named as the experiment code; in this folder, the gene-level r-scores for both validation and test set and the weights checkpoint file will be stored.

2. Inference and visualization

2.1 Inference

At inference, trained models are used to infer patch-level expression. Patch logits are stored as .pt files and can be used to plot heatmaps of the genes of interest. The inference scripts requires the following arguments:

2.2 Visualisation

The predicted localization of gene activity can be visually represented by plotting patch logits over the histology sample. Heatmaps are stored in the .jpeg format. The following arguments are needed:

python heatmaps.py --gene_name CD19 --slide_id SLIDE_ID --tissue_name EsophagusMucosa --tissue_code EMUC --save_dir /path/to/save/dir --features_dir /path/to/features/dir --patch_logits_dir /path/to/patch_logits/dir/ --results_dir /path/to/results/dir/ --slides_dir /path/to/wsi/dir/ --multiple_patch_sets

image

Tissue multiclass segmentation by tiles clustering

The following scripts expect the following file organization for the WSI; please, apply changes to the code in case of a different structure.

SLIDES_DIRECTORY/
    ├── Tissue1
    │   ├── slide_1.svs
    │   ├── slide_2.svs
    │   └── ...
    ├── Tissue2
    │   ├── slide_3.svs
    │   ├── slide_4.svs
    │   └── ...

To segment tissues into substructures or localised pathologies by patch-level classification using a k-Nearest Neighbors model, two steps are required:

  1. Definition of instances and labels for the k-NN; instances are patch-level features, while classes are defined in a yaml file. The instances used to fit the k-NN have been hand-labelled. The following script creates a .h5 file containing patch features and their corresponding labels; this file will be then used to fit the k-NN model for tiles classification.
    python tiles_classification/define_clusters_kNN.py --tissue_name Heart --checkpoint_path /path/to/features_extraction/checkpoint.pth
  2. Multi-class segmentation of H&E tissue samples by tiles classification. The script loads the previously defined h5 file, fits a k-NN model using the features and labels stored in the .h5 and classifies all the patches of the WSI. Segmentation masks and csv files containing the class of each patch (identified by the upper left corner coordinates) are output. As arguments, the tissue name, the output directory, the patch features main directory and the slides directory are required.
    cd ./tiles_classification
    python multiclass_tissue_segmentation.py --tissue_name Heart --output_dir /path/to/output/dir/ --features_dir /path/to/features/dir/ --slides_dir /path/to/slides/dir/

    A script for fine-grained segmentation is also provided; in this case, the 4 partially overlapping patch sets are used for the same slide and, in the regions where multiple patch sets overlap, classes are assigned by majority voting.

cd ./tiles_classification
python fine_grained_multiclass_tissue_segmentation.py --tissue_name Heart --output_dir /path/to/output/dir/ --features_dir /path/to/features/dir/ --slide_name /slide/name

image

Image derived phenotypes

Image phenotypes (e.g. amount of mucosa in colon samples, amount of adipose, etc.) are derived using the patch classes output by the multiclass tissue segmentation script; indeed, these phenotypes reflect the relative amount of each target class in a sample. The following script can be used to compute such phenotypes as proportions (with respect to the sample size). This will make the compositional phenotypes comparable across samples.

cd ./image_derived_phenotypes
python compute_IDPs.py --tissue_name EsophagusMucosa --segmentation_dir /path/to/segmentation/dir/ --output_dir /path/to/idps/dir/

The script outputs a csv file for each tissue, in the following format:

Slide ID IDP_0 IDP_1 IDP_2 IDP_3 IDP_4 IDP_5 IDP_6
SLIDE_001 20.9% 25.1% 44.3% 1.0% 2.5% 1.2% 4.9%
SLIDE_002 39.6% 23.1% 33.0% 1.5% 1.2% 2.8% 3.5%

Compositional phenotypes are easy to interpretate, but they are not the appropriate choice for statistical analysis, given their compositional nature. For this reason, they can be turned into pivot coordinates (a special case of a isometric logratio transformation):

cd ./image_derived_phenotypes
python compute_pivot_coordinates.py --tissue_name EsophagusMucosa --idps_dir /path/to/idps/dir/

SSES - Substructure-Specific Enrichment Analysis

SSES combines results from RNAPath (local RNASeq prediction) with tileclassification into tissue substructures or localised pathologies. The output of this analyis is a set of enrichment scores, one per each couple (gene, substructure) indicating the ratio between the predicted expression inside the substructure/pathology and the bulk (sample-level) predicted expression; the bigger this value, the more the expression of the gene will be focused in that substructure/pathology. The following arguments are required

python enrichment.py --tissue_name EsophagusMucosa --tissue_code EMUC --patch_logits_dir /path/to/patch/logits --segmentation_dir /path/to/segmentation/results/ --features_dir /path/to/features

The script outputs a csv file with as many row as the number of genes and as many columns as the number of identified classes (substructures/pathologies).

Differential expression analysis

Differential expression analysis of image derived phenotypes has been performed by fitting linear models. To run the analysis, some parameters are required:

cd ./differential_expression_analysis
python differential_expression_analysis_IDPs.py --tissue_name EsophagusMucosa --gtex_normalized_expression_bed_file /path/to/gtex/expression/bed --gtex_subject_phenotypes_file /path/to/subject/phenotypes --gtex_covariates_file /path/to/gtex/covariates --idps_format pivot

GWAS

The genome-wide association analysis was conducted using nf-pipeline-regenie (v1.8.1). A config file defining the computational environment and a config file for the project (example in ./gwas/gtex.conf) are needed to launch the pipeline. A complete description on how to use it and template files can be found in the linked repository.

Regional plots, Manhattan plots and quantile-quantile plots were generated with GWASLab (v3.4.21).

image

Interaction eQTLs

Interaction eQTLs have been analyzed using tensorqtl, a gpu-enabled QTL mapper; the required input are the genoptypes, the gene expression, the interaction term (in this case, the image derived phenotypes, one per donor) and the coviariates.

image

Supplementary Material and Data

Data Link
Features Extraction DINO checkpoint (.pt)
RNAPath RNASeq dataframe
RNAPath Results and checkpoints
RNAPath Tissue code mapping
Image Derived Phenotypes Derived substructures
Image Derived Phenotypes Differential expression analysis summary stats
Image Derived Phenotypes Differential expression analysis - Genes enrichment
Image Derived Phenotypes GWAS summary stats
Image Derived Phenotypes ieQTLs summary stats
RNAPath - Image Derived Phenotypes SSES enrichments
Labelled tiles (annotations) annotations

raw eQTL summary statistics can be provided on request (craig.glastonbury[at]fht.org) (300GB).