P. Thangarajah, G. Hothi, K. Korolek*, A. Singh
Digitized Hematoxylin & Eosin (H&E) images are used to predict tumour genotypes, mutations, gene expression and clinical phenotypes. However, H&E staining can only identify larger structures (e.g., cell nuclei) and preparation and digitization of histological samples can lead to errors such as colour variations, resulting in artifacts and inconsistencies in histological image analysis systems. To combat this, recent technology such as the GeoMx Digital Spatial Profiler (DSP), allows the staining of up to four morphological markers concurrently followed by the measurement of up to 18,000 genes in select regions of interest (ROIs). The GeoMx DSP does this by imaging or staining whole tissue sections for RNA or protein, followed by the counting of gene expression levels using an nCounter analysis system. However, GeoMx DSP requires ROIs to be selected manually by the user, resulting in difficulties with discriminating disease conditions and the inability to spatially resolve gene expression for entire tissue slides. To resolve these difficulties, we aim to train convolutional neural networks (CNN), a form of deep learning, with tissue imaging data to predict diabetic kidney disease (DKD) and spatial gene expression data from the GeoMx DSP Spatial Organ Atlas (i.e., a web base that delivers whole transcriptome data sets of organ tissues with spatial context) by Nanostring Technologies.
Whole Slide Images of 3 diabetic and 3 healthy patients’ kidney samples were obtained from the Nanostring database. Each image was split into thousands of smaller images, each of size 256x256 pixels (patches). These patches were then transformed into 1024-dimenstional vectors using a pre-existing computer vision model called ResNet50 (a pre-trained model that can classify images into 1000 categories). After the patches were created, the data and corresponding labels were split into training, test, and validation sets. The ResNet50 will be fine-tuned to our problem by further training an additional neural network layer to predict DKD from healthy controls (ResNet50-binary). The ResNet-binary model will be applied to the test data and evaluated using the F1 score, a measurement of a model's accuracy that focuses on precision (fraction of correctly classified positive samples among all predicted positive samples) and recall (ratio between correctly classified positive samples and the combination of correctly classified positive samples with incorrectly classified negative samples). To predict spatial gene expression data, we will use images of the 231 pre-selected ROIs and their corresponding whole genome expression. Similar to the ResNet-binary model development, we will fine-tune the ResNet50 model by adding an additional layer to predict gene expression (ResNet50-continuous). The ResNet-continuous model will be applied to the test data and evaluated using the mean squared error.
Results: We expect our trained CNN to discriminate between healthy kidney and DKD tissue patches, as structural changes like thickening of tubular basement membranes and interstitial widening can be visibly observed in the DKD samples. Studies using H&E images have successfully predicted gene expression; therefore, we expect similar outcomes using higher resolution images from GeoMx. For the prediction of spatial gene expression data, we expect a higher correlation between cell-specific genes than other genes as imaging slides are based on morphology markers (tissue sections are stained with dyes which highlight different structures). For example, we expect our models would better predict the expression of CD45 compared to other markers if tissue imaging slides were stained for CD45.
Open-source deep learning models effectively classify DKD from healthy tissue patches and predict spatial gene expression data. The modified models could be used alongside the GeoMx DSP to help pathologists make clinical diagnoses and prognostics.
module load git
git --version
git clone https://github.com/CompBio-Lab/geomx2rna.git
cd geomx2rna/
module load git
export ALLOC=st-allocation-code
mkdir /arc/project/$ALLOC/$USER/
cd /arc/project/$ALLOC/$USER/
git clone https://github.com/CompBio-Lab/geomx2rna.git
cd geomx2rna/
mkdir /scratch/$ALLOC/$USER cd /scratch/$ALLOC/$USER git clone https://github.com/CompBio-Lab/geomx2rna.git cd geomx2rna/
* $ALLOC: Sockeye allocation code
* $USER: UBC Campus wide login (should be already set)
## Jupyter notebook singularity image setup on HPC (UBC ARC Sockeye)
* $ALLOC: Sockeye allocation code
* $USER: UBC Campus wide login
Convert the following set of steps into reproducible workflow:
1. move to project location
cd /arc/project/$ALLOC/$USER/geomx2rna/
2. pull jupyter notebook by running jupyter_singularity.sh
sh jupyter_singularity.sh
1. add packages to singularity image (run the following line by line)
module load gcc singularity
singularity shell --home /scratch/$ALLOC/$USER/geomx2rna/ --env XDG_CACHE_HOME=/scratch/$ALLOC/$USER/geomx2rna/ /arc/project/$ALLOC/$USER/geomx2rna/jupyter-datascience.sif
conda create --prefix /arc/project/$ALLOC/$USER/geomx2rna/myenv python=3.7
conda install -y ipykernel --prefix /arc/project/$ALLOC/$USER/geomx2rna/myenv
conda install -y pytorch torchvision torchaudio captum cudatoolkit=10.2 -c pytorch --prefix /arc/project/$ALLOC/$USER/geomx2rna/myenv
conda run --prefix /arc/project/$ALLOC/$USER/geomx2rna/myenv python -m ipykernel install --user --name myenv
exit
## install additional packages to environment (e.g. matplotlib)
cd /arc/project/$ALLOC/$USER/geomx2rna/ source activate myenv/ conda install -c conda-forge matplotlib pip install torch==1.9.0 torchvision==0.10.0 --extra-index-url https://download.pytorch.org/whl/cu102 conda install captum -c pytorch conda deactivate
## Run jupyter notebook
1. Create a job directory in /scratch for your personal Jupyter Notebooks job(s)
cd /scratch/$ALLOC/$USER/geomx2rna/
2. create job script using template: jupyter-datascience.pbs
cp jupyter-datascience.pbs geomx_job.pbs vi geomx_job.pbs
3. hit 'i' then enter the following:
* update allocation code (st-allocation-code) and email (your_email@domain) in PBS header
* check if data path exists
1. Submit a job script
qsub geomx_job.pbs
- this creates a connection file
3. open another terminal window (since the other one is logged into sockeye)
- copy ssh instructions from the freshly producted connection file
example
ssh -N -L 8888:${HOSTNAME}:${PORT} ${USER}@sockeye.arc.ubc.ca
## Bugs and feature requests
Have a bug or a feature request? Please add your request here: https://github.com/CompBio-Lab/geomx2rna/issues
## Contributing
Please feel free to make a pull request if you would like to modify anything.
## Copyright and license
Copyright 2021 CompBio-Lab Inc.
Code released under the [MIT license](https://github.com/CompBio-Lab/geomx2rna/blob/main/LICENSE).