This repository contains our tools & research for running DeepCell segmentation and QuPath measurements on Google Cloud Batch.
Our results show an overall improvement from ~13 hours to ~10 minutes for segmenting & measuring a cell. The starting point was running on a laptop or colo machine, and our work ran on GCP Batch with some cloud-focused enhancements.
The workflow operates on one or more input images, converted from image to numpy
pixel array. Then DeepCell preprocesses the data (denoising & normalization), runs the segmentation prediction, and postprocesses the predictions into a cell mask. Then, we load the image and mask into QuPath to compute quantitative metrics (size, channel intensity, etc.) for further analysis. For an example downstream usage, see SpaFlow (cell clustering & quantification).
Here is the workflow diagram:
You'll need a JSON file available in a cloud bucket, configuring the application environment. Create a file something like this:
{
"segment_container_image": "$REPOSITORY/benchmarking:latest",
"quantify_container_image": "$REPOSITORY/qupath-project-initializer:latest",
"bigquery_benchmarking_table": "$PROJECT.$DATASET.$TABLE",
"region": "$REGION",
"networking_interface": {
"network": "the_network",
"subnetwork": "the_subnetwork",
"no_external_ip_address": true
},
"service_account": {
"email": "the_service@account.com"
}
}
You'll need to replace the variables with your environment.
networking_interface
and service_account
sections are optional if you want to use default settings.For example, using the Docker Hub containers & skipping benchmarking & default networking + service account:
{
"segment_container_image": "dchaley/deepcell-imaging:latest",
"quantify_container_image": "dchaley/qupath-project-initializer:latest",
"region": "us-central1"
}
Upload this file somewhere to GCP storage. We put ours in the root of our working bucket. You'll pass this GS URI as a parameter to the scripts.
To run DeepCell on input images then compute QuPath measurements, use the helper scripts/segment-and-measure.py
. There are two ways to run this script: (1) running on a QuPath workspace, and (2) running on explicit paths.
QuPath workspace:
Many QuPath projects are organized something like this:
📁 Dataset
↳ 📁 OMETIFF
↳ 🖼️ SomeTissueSample.ome.tiff
↳ 🖼️ AnotherTissueSample.ome.tiff
↳ 📁 NPZ_INTERMEDIATE
↳ 🔢 SomeTissueSample.npz
↳ 🔢 AnotherTissueSample.npz
↳ 📁 SEGMASK
↳ 🔢 SomeTissueSample_WholeCellMask.tiff
↳ 🔢 SomeTissueSample_NucleusMask.tiff
↳ 🔢 AnotherTissueSample_WholeCellMask.tiff
↳ 🔢 AnotherTissueSample_NucleusMask.tiff
↳ 📁 REPORTS
↳ 📄 SomeTissueSample_QUANT.tsv
↳ 📄 AnotherTissueSample_QUANT.tsv
↳ 📁 PROJ
↳ 📁 data
↳ ...
↳ 📄 project.qpproj
To generate segmentation masks & quantification reports, run the following command:
scripts/segment-and-measure.py
--env_config_uri gs://bucket/path/to/env-config.json
workspace gs://bucket/path/to/dataset
This will enumerate all files in the OMETIFF
directory that have matching files in NPZ_INTERMEDIATE
, and run DeepCell segmentation to generate the SEGMASK
numpy files. Then it will run QuPath measurements to generate the REPORTS
files.
If your folder structure is different (for example OME-TIFF
instead of OMETIFF
) you can use these parameters to specify the workspace subdirectories: --images_subdir
, --npz_subdir
, --segmasks_subdir
, --project_subdir
, --reports_subdir
. Put these parameters after the workspace
command.
Explicit paths.
You can also specify all paths explicitly (the files don't have to be organized in a dataset). To do so, run this command:
scripts/segment-and-measure.py
--env_config_uri gs://bucket/path/to/env-config.json
paths
--images_path gs://bucket/path/to/ometiffs
--numpy_path gs://bucket/path/to/npzs
--segmasks_path gs://bucket/path/to/segmasks
--project_path gs://bucket/path/to/project
--reports_path gs://bucket/path/to/reports
In either case, when you download the QuPath project, you'll need to download the OMETIFF files as well. When you open the project it will prompt you to select the base directory containing the OMETIFFs, and from there should automatically remap the image paths.
You can use the parameter --image_filter
to only operate on a subset of the OMETIFFs. For example,
scripts/segment-and-measure.py
--env_config_uri gs://.../config.json
--image_filter SomeTissue
workspace gs://path/to/workspace
This will operate on every file whose name begins with the string SomeTissue
. This would match SomeTissueSample
, SomeTissueImage
, etc. Note that this parameter has to come before the workspace
or paths
parameter.
DeepCell does not process TIFF files. The TIFF channels must be extracted into Numpy arrays first.
DeepCell divides the preprocessed input into 512x512 tiles which it predicts in batches, then recombines into a single image for postprocessing.
This makes the prediction very resource-efficient, note however that pre- and post-processing still operate on the entire image. This is particularly problematic for post-processing which is very resource-intensive.
The prediction step outputs which pixels are most likely to be the center of their cell. The post-processing step runs image analysis algorithms to create the final cell masks. It operates a bit like a "flood fill" to expand the center out.
This uses the h_maxima grayscale reconstruction algorithm, which is (counterintuitively) far slower than prediction itself for large images.
Once we have cell predictions, we need to generate quantified metrics for the cells: location, size, channel intensities, and so on. This is crucial for downstream processing & analysis, including in a QuPath desktop environment. For example, a researcher might provide an analyzed & packaged QuPath project to a principal investigator for review.
QuPath is distributed as JAR files. Bioinformaticians typically run Groovy scripts in the embedded QuPath environment, however we don't have a desktop or VM environment for that. Instead we compile Kotlin code with the JARs to run on Google Batch.
The source code for quantifying the metrics plus building the container is located in a different repository: qupath-project-initializer.
QuPath measurements are computed a cell at a time. The algorithm re-fetches the image region containing the cell for each cell. This is prohibitively expensive for bulk measurement.
Adding code to prefetch the image into memory, then retrieve subregions from memory, provided a dramatic ~99% speed-up.
GPU makes a dramatic difference in model inference time.
Memory usage increases linearly with number of pixels.
Here are some areas we've identified:
This repo uses git-lfs (Git Large File System) to exclude large files (like sample numpy data) in the source history. This process is automatic & transparent, but requires git-lfs
to be installed beforehand. Please see these instructions.
TLDR,
brew install git-lfs
.sudo [apt-get | yum] install git-lfs
.git-lfs
is included in the Git distribution.Set these repository variables:
DOCKERHUB_REPOSITORY
eg dchaley
If you set this, you need to set these further variables:
_DOCKERHUB_USERNAME_SECRET_NAME
eg dockerhub-username/versions/1
_DOCKERHUB_PASSWORD_SECRET_NAME
eg dockerhub-password/versions/1
And you need the corresponding secrets in the GCP project.
GCP_ARTIFACT_REPOSITORY
eg my-repository
GCP_PROJECT_ID
eg my-gcp-project-4321
GCP_REGION
eg us-central1
Nothing special. You just need Python 3.10 at the latest.
python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Some incantations are needed to work on Apple silicon computers. You also need Python 3.9.
DeepCell depends on tensorflow
, not tensorflow-macos
. Unfortunately we need tensorflow-macos
specifically to provide TF2.8 on arm64 chips.
The solution is to install the packages one at a time so that the DeepCell failure doesn't impact the other packages.
python3.9 -m venv venv
source venv/bin/activate
pip install -r requirements-mac-arm64.txt
cat requirements.txt | xargs -n 1 pip install
# Let it fail to install DeepCell, then:
pip install -r requirements.txt --no-deps
# Lastly install our own library. Note --no-deps
pip install --editable . --no-deps
I think but am not sure that the first --no-deps
invocation is unnecessary as pip install
installs dependencies.