GoogleCloudPlatform / vertex-ai-alphafold-inference-pipeline

This repository compiles prescriptive guidance and code samples demonstrating how to operationalize AlphaFold batch inference using Vertex AI Pipelines.
Apache License 2.0
57 stars 23 forks source link

AlphaFold batch inference with Vertex AI Pipelines

This repository compiles prescriptive guidance and code samples demonstrating how to operationalize AlphaFold batch inference using Vertex AI Pipelines.

Code sample base on v2.3.2 of AlphaFold.

For protein viewer in Alphafold Portal we're using 3Dmol CDN binary which licensed under BSD 3-Clause License.

Note: Alphafold Portal README will be available in the same repository here.

Solutions architecture overview

The following diagram depicts the architecture of the solution.

Architecture

Key design patterns:

The core of the solution is a set of parameterized KFP components that encapsulate key tasks in the AlphaFold inference workflow. The AlphaFold KFP components can be composed to implement optimized inference workflows. Currently, the repo contains two example pipelines:

The repository also includes a set of Jupyter notebooks that demonstrate how to configure, submit, and analyze pipeline runs.

Repository structure

src/components - KFP components encapsulating AlphaFold inference tasks

src/pipelines - Example inference pipelines

env-setup - Terraform for setting up a sandbox environment

*.ipynb - Jupyter notebooks demonstrating how to configure and run the inference pipeline.

Managing genetic databases

AlphaFold inference utilizes a set of genetic databases. To maximize database search performance when running multiple inference pipelines concurrently, the databases are hosted on a high performance NFS file share managed by Cloud Filestore.

Before running the pipelines you need to configure a Cloud Filestore instance and populate it with genetic databases.

The Environment requirements section describes how to configure the GCP environment required to run the pipelines, including Cloud Filestore configuration.

The repo also includes an example Terraform configuration that builds a sandbox environment meeting the requirements. If you intend to use the provided Terraform configuration you need to pre-stage the genetic databases and model parameters in a Google Cloud Storage bucket. When the Terraform configuration is applied, the databases will be copied from the GCS bucket to the provisioned Filestore instance and the model parameters will be copied to the provisioned regional GCS bucket.

Follow the instructions on the AlphaFold repo to download the genetic databases and model parameters.

Notes:

These are the current minimum commands required:

sudo apt install aria2
scripts/download_all_data.sh <DOWNLOAD_DIR>
scripts/download_small_bfd.sh <DOWNLOAD_DIR>

Environment requirements

The below diagram summarizes Google Cloud environment configuration required to run AlphaFold inference pipelines.

Infrastructure

Quick-start Guide

The repo includes a Terraform configuration that can be used to provision a sandbox environment that complies with the requirements detailed in the previous section. The configuration builds the sandbox environment as follows:

You need to have "Owner" privileges to set up the sandbox environment.

You will be using Cloud Shell to deploy the infrastructure by applying the Terraform configuration.

Step 1 - Select a Google Cloud project and open Cloud Shell

In the Google Cloud Console, navigate to your project and open Cloud Shell. Make sure you have Owner privileges.

Step 2 - Setup environment variables

Run the following commands to setup environment variables.

export PROJECT_ID=<YOUR PROJECT ID>

Step 3 - Apply the Terraform configuration

First, clone the repo and prepare environment variables.

cd ${HOME}
git clone https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline.git

REPO="vertex-ai-alphafold-inference-pipeline"
SOURCE_ROOT=${HOME}/${REPO}
TERRAFORM_RUN_DIR=${SOURCE_ROOT}/env-setup
cd ${TERRAFORM_RUN_DIR}

Create the terraform variables file by making a copy from the template and set the terraform variables that reflect your environment. The sample file has all the required variables listed. The variables are defined as follows.

cp ${TERRAFORM_RUN_DIR}/terraform-sample.tfvars ${TERRAFORM_RUN_DIR}/terraform.tfvars

Edit the Terraform variables file. If using Vim:

 vim ${TERRAFORM_RUN_DIR}/terraform.tfvars

The following is only a sample values to illustrate .tfvars file actual values, please modify the values accordingly:

project_id              = "my-project-1"
region                  = "us-central1"
zone                    = "us-central1-b"
network_name            = "alphaf-network"
subnet_name             = "alphaf-subnetwork"
workbench_instance_name = "alphaf-wb"
filestore_instance_id   = "alphaf-nfs"
gcs_bucket_name         = "my-project-1-alphaf-bucket"
gcs_dbs_path            = "alphafold-uscentral-dbcopy/new/af-dataset"
ar_repo_name            = "alphaf-kfp"

Note: gcs_dbs_path shouldn't use gs:// prefix.

Notes:

Apply Terraform configuration. This step may take a few hours so be patient. Reason: the steps to build two CUDA images have been moved as part of the terraform script (pipeline_images.tf). You'll need to monitor the 1.5-2 hours long image build job, either from Log Explorer or from the URL generated from the terraform apply output. You may leave the Cloud Shell as it is a background process. This is an improved process from previous version where you'll need to keep the terminal session running to complete the image build process.

terraform -chdir="${TERRAFORM_RUN_DIR}" init
terraform -chdir="${TERRAFORM_RUN_DIR}" apply

In addition to provisioning and configuring the required services, the Terraform configuration starts a Vertex Training job that copies the reference databases from the GCS location to the provisioned Filestore instance. You can monitor the job using the links printed out by Terraform. The job may take a couple of hours to complete.

Notes:

Preparing Vertex Workbench

In the GCP project, a Vertex Workbench user-managed notebook instance is used as a development/experimentation environment to customize, submit, and analyze inference pipelines runs. There are a couple of setup steps that are required before you can use example notebooks.

Open Vertex AI Workbench, connect to the notebook instance clicking on the "OPEN JUPYTERLAB" link besides the notebook name.

On the JupyterLab interface, launch a new Terminal tab and execute the following command:

git clone https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline.git

(Optional) Setup Alphafold Portal

Follow the README.md under $SOURCE_ROOT/env-setup-portal directory, or view the README directly from Github:

https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline/blob/alphafold-portal/env-setup-portal/README.md

Congratulations!

Now, you're ready to follow the instructions on the notebook "1-alphafold-quick-start.ipynb".

Clean up

In case you want to destroy all the deployed infrastructure, follow this instruction.

Note that all infrastructure will be destroyed including any changes you have done in the code inside the Vertex Workbench notebook instance. Make sure you commit all your changes before executing this step.

Back in the Google Cloud Console, open Cloud Shell and execute the following commands.

REPO="vertex-ai-alphafold-inference-pipeline"
SOURCE_ROOT=${HOME}/${REPO}
TERRAFORM_RUN_DIR=${SOURCE_ROOT}/env-setup
cd ${TERRAFORM_RUN_DIR}

terraform -chdir="${TERRAFORM_RUN_DIR}" destroy

What Next?

Check out Alphafold Portal (User Interface) for Vertex AI Alphafold Inference Pipeline installation guide here.