Francesco Cisternino, Sara Ometto, Soumick Chatterjee, Edoardo Giacopuzzi, Adam P. Levine, Craig A. Glastonbury
Published in Nature Communications.
Segmentation allows us to separate the tissue from background in WSIs. The output are binary masks (stored as .jpeg
).
python preprocessing/segmentation_patching/segmentation.py
preprocessing/segmentation_patching/config.yaml
The tissue region of WSI, identified by segmentation, is divided into small squared tiles (or patches) (e.g. 128x128); this allows us to process the WSI using GPUs and to obtain local (tile-level) results.
cd ./preprocessing/segmentation_patching
python tiling.py
Tile images are turned into features vectors capturing their morphological content. To do this, we use a vision transformer (ViT-S) trained on 1.7 M histology patches using a self-supervised approach.
cd ./preprocessing/features_extraction
python extract_features.py
preprocessing/features_extraction/config.yaml
.pt
file containing a 2D tensor of shape (K, 384)
, where K is the number of tiles and 384 the number of features..h5
file for each slide containing both the features and the final list of coordinates.
RNAPath training requires patch features to represent WSIs, train/validation/test splits, a .txt
indicating the list of genes to be profiled (example in ./resources/gene_set_example.txt
) and a csv file with the genes TPMs.
In this study, 4 different partially overlapping representations of the same slide are computed: to the original patch set, 3 other sets have been added by shifting the original one by 32x32, 64x64 and 96x96 pixels. During each training interation, a single representation (patch set) is randomly selected. In case of a different representation, the script ./datasets/dataset_generic.py
should be modified.
The RNASeq dataframe (rnaseq_complete.csv
), available in the supplementary data (at the bottom of this README.md
), should be placed in ./resources
.
The training script requires some arguments to be set:
python train.py --exp_code test_0 --tissue_code HEA --data_root_dir /path/to/features/dir/
During training, training and validation loss values will be logged and a results folder will be created (inside results_dir
) and named as the experiment code; in this folder, the gene-level r-scores for both validation and test set and the weights checkpoint file will be stored.
At inference, trained models are used to infer patch-level expression. Patch logits are stored as .pt
files and can be used to plot heatmaps of the genes of interest.
The inference scripts requires the following arguments:
python inference.py --tissue_name Heart --tissue_code HEA --features_dir /path/to/features/dir/ --output_dir /path/to/patch_logits/dir/ --results_dir /path/to/results/dir/ --ckpt_path /path/to/rnapath/checkpoint.pt --multiple_patch_sets
The predicted localization of gene activity can be visually represented by plotting patch logits over the histology sample. Heatmaps are stored in the .jpeg
format.
The following arguments are needed:
python heatmaps.py --gene_name CD19 --slide_id SLIDE_ID --tissue_name EsophagusMucosa --tissue_code EMUC --save_dir /path/to/save/dir --features_dir /path/to/features/dir --patch_logits_dir /path/to/patch_logits/dir/ --results_dir /path/to/results/dir/ --slides_dir /path/to/wsi/dir/ --multiple_patch_sets
The following scripts expect the following file organization for the WSI; please, apply changes to the code in case of a different structure.
SLIDES_DIRECTORY/
├── Tissue1
│ ├── slide_1.svs
│ ├── slide_2.svs
│ └── ...
├── Tissue2
│ ├── slide_3.svs
│ ├── slide_4.svs
│ └── ...
To segment tissues into substructures or localised pathologies by patch-level classification using a k-Nearest Neighbors model, two steps are required:
.h5
file containing patch features and their corresponding labels; this file will be then used to fit the k-NN model for tiles classification.
python tiles_classification/define_clusters_kNN.py --tissue_name Heart --checkpoint_path /path/to/features_extraction/checkpoint.pth
.h5
and classifies all the patches of the WSI. Segmentation masks and csv files containing the class of each patch (identified by the upper left corner coordinates) are output.
As arguments, the tissue name, the output directory, the patch features main directory and the slides directory are required.
cd ./tiles_classification
python multiclass_tissue_segmentation.py --tissue_name Heart --output_dir /path/to/output/dir/ --features_dir /path/to/features/dir/ --slides_dir /path/to/slides/dir/
A script for fine-grained segmentation is also provided; in this case, the 4 partially overlapping patch sets are used for the same slide and, in the regions where multiple patch sets overlap, classes are assigned by majority voting.
cd ./tiles_classification
python fine_grained_multiclass_tissue_segmentation.py --tissue_name Heart --output_dir /path/to/output/dir/ --features_dir /path/to/features/dir/ --slide_name /slide/name
Image phenotypes (e.g. amount of mucosa in colon samples, amount of adipose, etc.) are derived using the patch classes output by the multiclass tissue segmentation script; indeed, these phenotypes reflect the relative amount of each target class in a sample. The following script can be used to compute such phenotypes as proportions (with respect to the sample size). This will make the compositional phenotypes comparable across samples.
cd ./image_derived_phenotypes
python compute_IDPs.py --tissue_name EsophagusMucosa --segmentation_dir /path/to/segmentation/dir/ --output_dir /path/to/idps/dir/
The script outputs a csv file for each tissue, in the following format:
Slide ID | IDP_0 | IDP_1 | IDP_2 | IDP_3 | IDP_4 | IDP_5 | IDP_6 |
---|---|---|---|---|---|---|---|
SLIDE_001 | 20.9% | 25.1% | 44.3% | 1.0% | 2.5% | 1.2% | 4.9% |
SLIDE_002 | 39.6% | 23.1% | 33.0% | 1.5% | 1.2% | 2.8% | 3.5% |
Compositional phenotypes are easy to interpretate, but they are not the appropriate choice for statistical analysis, given their compositional nature. For this reason, they can be turned into pivot coordinates (a special case of a isometric logratio transformation):
cd ./image_derived_phenotypes
python compute_pivot_coordinates.py --tissue_name EsophagusMucosa --idps_dir /path/to/idps/dir/
SSES combines results from RNAPath (local RNASeq prediction) with tileclassification into tissue substructures or localised pathologies. The output of this analyis is a set of enrichment scores, one per each couple (gene, substructure) indicating the ratio between the predicted expression inside the substructure/pathology and the bulk (sample-level) predicted expression; the bigger this value, the more the expression of the gene will be focused in that substructure/pathology. The following arguments are required
python enrichment.py --tissue_name EsophagusMucosa --tissue_code EMUC --patch_logits_dir /path/to/patch/logits --segmentation_dir /path/to/segmentation/results/ --features_dir /path/to/features
The script outputs a csv file with as many row as the number of genes and as many columns as the number of identified classes (substructures/pathologies).
Differential expression analysis of image derived phenotypes has been performed by fitting linear models. To run the analysis, some parameters are required:
cd ./differential_expression_analysis
python differential_expression_analysis_IDPs.py --tissue_name EsophagusMucosa --gtex_normalized_expression_bed_file /path/to/gtex/expression/bed --gtex_subject_phenotypes_file /path/to/subject/phenotypes --gtex_covariates_file /path/to/gtex/covariates --idps_format pivot
The genome-wide association analysis was conducted using nf-pipeline-regenie (v1.8.1). A config file defining the computational environment and a config file for the project (example in ./gwas/gtex.conf) are needed to launch the pipeline. A complete description on how to use it and template files can be found in the linked repository.
Regional plots, Manhattan plots and quantile-quantile plots were generated with GWASLab (v3.4.21).
Interaction eQTLs have been analyzed using tensorqtl, a gpu-enabled QTL mapper; the required input are the genoptypes, the gene expression, the interaction term (in this case, the image derived phenotypes, one per donor) and the coviariates.
Data | Link |
---|---|
Features Extraction | DINO checkpoint (.pt) |
RNAPath | RNASeq dataframe |
RNAPath | Results and checkpoints |
RNAPath | Tissue code mapping |
Image Derived Phenotypes | Derived substructures |
Image Derived Phenotypes | Differential expression analysis summary stats |
Image Derived Phenotypes | Differential expression analysis - Genes enrichment |
Image Derived Phenotypes | GWAS summary stats |
Image Derived Phenotypes | ieQTLs summary stats |
RNAPath - Image Derived Phenotypes | SSES enrichments |
Labelled tiles (annotations) | annotations |
raw eQTL summary statistics can be provided on request (craig.glastonbury[at]fht.org) (300GB).