This repository compiles prescriptive guidance and code samples demonstrating how to operationalize AlphaFold batch inference using Vertex AI Pipelines.
Code sample base on v2.3.2 of AlphaFold.
For protein viewer in Alphafold Portal we're using 3Dmol CDN binary which licensed under BSD 3-Clause License.
Note: Alphafold Portal README will be available in the same repository here.
The following diagram depicts the architecture of the solution.
Key design patterns:
The core of the solution is a set of parameterized KFP components that encapsulate key tasks in the AlphaFold inference workflow. The AlphaFold KFP components can be composed to implement optimized inference workflows. Currently, the repo contains two example pipelines:
The repository also includes a set of Jupyter notebooks that demonstrate how to configure, submit, and analyze pipeline runs.
src/components
- KFP components encapsulating AlphaFold inference tasks
src/pipelines
- Example inference pipelines
env-setup
- Terraform for setting up a sandbox environment
*.ipynb
- Jupyter notebooks demonstrating how to configure and run the inference pipeline.
AlphaFold inference utilizes a set of genetic databases. To maximize database search performance when running multiple inference pipelines concurrently, the databases are hosted on a high performance NFS file share managed by Cloud Filestore.
Before running the pipelines you need to configure a Cloud Filestore instance and populate it with genetic databases.
The Environment requirements section describes how to configure the GCP environment required to run the pipelines, including Cloud Filestore configuration.
The repo also includes an example Terraform configuration that builds a sandbox environment meeting the requirements. If you intend to use the provided Terraform configuration you need to pre-stage the genetic databases and model parameters in a Google Cloud Storage bucket. When the Terraform configuration is applied, the databases will be copied from the GCS bucket to the provisioned Filestore instance and the model parameters will be copied to the provisioned regional GCS bucket.
Follow the instructions on the AlphaFold repo to download the genetic databases and model parameters.
Notes:
These are the current minimum commands required:
sudo apt install aria2
scripts/download_all_data.sh <DOWNLOAD_DIR>
scripts/download_small_bfd.sh <DOWNLOAD_DIR>
The below diagram summarizes Google Cloud environment configuration required to run AlphaFold inference pipelines.
connect-mode
setting set to PRIVATE_SERVICE_ACCESS
storage.admin
aiplatform.user
gs://<BUCKET_NAME>/params
. The repo includes a Terraform configuration that can be used to provision a sandbox environment that complies with the requirements detailed in the previous section. The configuration builds the sandbox environment as follows:
You need to have "Owner" privileges to set up the sandbox environment.
You will be using Cloud Shell to deploy the infrastructure by applying the Terraform configuration.
In the Google Cloud Console, navigate to your project and open Cloud Shell. Make sure you have Owner privileges.
Run the following commands to setup environment variables.
export PROJECT_ID=<YOUR PROJECT ID>
First, clone the repo and prepare environment variables.
cd ${HOME}
git clone https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline.git
REPO="vertex-ai-alphafold-inference-pipeline"
SOURCE_ROOT=${HOME}/${REPO}
TERRAFORM_RUN_DIR=${SOURCE_ROOT}/env-setup
cd ${TERRAFORM_RUN_DIR}
Create the terraform variables file by making a copy from the template and set the terraform variables that reflect your environment. The sample file has all the required variables listed. The variables are defined as follows.
<PROJECT_ID>
- your GCP project id<PROJECT_NUMBER>
- your GCP project number<REGION>
- your compute region for the Filestore and Vertex Workbench Instance<ZONE>
- your compute zone<NETWORK_NAME>
- the name for the VPC network<SUBNET_NAME>
- the name for the VPC network<WORKBENCH_INSTANCE_NAME>
- the name for the Vertex Workbench instance, ex: alpha-wb<FILESTORE_INSTANCE_ID>
- the instance ID of the Filestore instance. See Naming your instance. Example: alphaf-nfs
<GCS_BUCKET_NAME>
- the name of the GCS regional bucket. See Bucket naming guidelines<GCS_DBS_PATH>
- the path to the GCS location of the genetic databases and model parameters.<ARTIFACT_REGISTRY_REPO_NAME>
- the Artifact Registry repository name to upload pipeline images images. Example: alphaf-kfp
CLIENT_ID
from OAuth Consent Screen. Populate this value later when doing setup for Alphafold PortalCLIENT_SECRET
from OAuth Consent Screen. Populate this value later when doing setup for Alphafold PortalFLASK_SECRET
by generating random string. See Generate Random UUID. Populate this value later when doing setup for Alphafold PortalIS_GCR_IO_REPO
"true" means you've used gcr.io before or have existing Alphafold Pipeline set up, stick with it. "false" means you're using new Alphafold Pipeline setup, skip gcr.io for now. Populate this value later when doing setup for Alphafold Portalcp ${TERRAFORM_RUN_DIR}/terraform-sample.tfvars ${TERRAFORM_RUN_DIR}/terraform.tfvars
Edit the Terraform variables file. If using Vim:
vim ${TERRAFORM_RUN_DIR}/terraform.tfvars
The following is only a sample values to illustrate .tfvars file actual values, please modify the values accordingly:
project_id = "my-project-1"
region = "us-central1"
zone = "us-central1-b"
network_name = "alphaf-network"
subnet_name = "alphaf-subnetwork"
workbench_instance_name = "alphaf-wb"
filestore_instance_id = "alphaf-nfs"
gcs_bucket_name = "my-project-1-alphaf-bucket"
gcs_dbs_path = "alphafold-uscentral-dbcopy/new/af-dataset"
ar_repo_name = "alphaf-kfp"
Note: gcs_dbs_path shouldn't use
gs://
prefix.
Notes:
<GCS_DBS_PATH>/params
Apply Terraform configuration. This step may take a few hours so be patient. Reason: the steps to build two CUDA images have been moved as part of the terraform script (pipeline_images.tf). You'll need to monitor the 1.5-2 hours long image build job, either from Log Explorer or from the URL generated from the terraform apply
output. You may leave the Cloud Shell as it is a background process. This is an improved process from previous version where you'll need to keep the terminal session running to complete the image build process.
terraform -chdir="${TERRAFORM_RUN_DIR}" init
terraform -chdir="${TERRAFORM_RUN_DIR}" apply
In addition to provisioning and configuring the required services, the Terraform configuration starts a Vertex Training job that copies the reference databases from the GCS location to the provisioned Filestore instance. You can monitor the job using the links printed out by Terraform. The job may take a couple of hours to complete.
Notes:
terraform.tfstate
before reading the official Terraform documentation**In the GCP project, a Vertex Workbench user-managed notebook instance is used as a development/experimentation environment to customize, submit, and analyze inference pipelines runs. There are a couple of setup steps that are required before you can use example notebooks.
Open Vertex AI Workbench, connect to the notebook instance clicking on the "OPEN JUPYTERLAB" link besides the notebook name.
On the JupyterLab interface, launch a new Terminal tab and execute the following command:
git clone https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline.git
Follow the README.md under $SOURCE_ROOT/env-setup-portal
directory, or view the README directly from Github:
https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline/blob/alphafold-portal/env-setup-portal/README.md
Now, you're ready to follow the instructions on the notebook "1-alphafold-quick-start.ipynb".
In case you want to destroy all the deployed infrastructure, follow this instruction.
Note that all infrastructure will be destroyed including any changes you have done in the code inside the Vertex Workbench notebook instance. Make sure you commit all your changes before executing this step.
Back in the Google Cloud Console, open Cloud Shell and execute the following commands.
REPO="vertex-ai-alphafold-inference-pipeline"
SOURCE_ROOT=${HOME}/${REPO}
TERRAFORM_RUN_DIR=${SOURCE_ROOT}/env-setup
cd ${TERRAFORM_RUN_DIR}
terraform -chdir="${TERRAFORM_RUN_DIR}" destroy
Check out Alphafold Portal (User Interface) for Vertex AI Alphafold Inference Pipeline installation guide here.