STRIDES / NIHCloudLabGCP

Documentation and tutorials on using GCP for biomedical research
https://cloud.nih.gov/resources/cloudlab/
16 stars 13 forks source link

GCP Tutorial Resources

We have pulled together a variety of tutorials here from disparate sources. Some use Compute Engine, others use Vertex AI notebooks, and others use only managed services. Tutorials are organized by research method, but we try to designate what GCP services are used to help you navigate.

Overview of Page Contents

Biomedical Workflows on GCP

There are a lot of ways to run workflows on GCP. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using one of the other methods listed here.

Artificial Intelligence and Machine Learning

Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Machine learning on GCP generally occurs within VertexAI. You can learn more about machine learning on GCP at this Google Crash Course. For hands-on examples, try out this module developed by San Francisco State University or this one from the University of Arkasas developed for the NIGMS Sandbox Project.

Now that the age of Generative AI (Gen AI) has arrived, Google has released a host of Gen AI offerings within the Vertex AI suite. Some examples of what generative AI models are capable of are extracting wanted information from text, transforming speech into text, generating images from descriptions and vice versa, and much more. Vertex AI's Vertex AI Studio console allows the user to rapidly create, test, and train generative AI models on the cloud in a safe and secure setting, see our overview in this tutorial. The studio also has ready-to-use models all contained within the Model Garden. These models range from foundation models, fine-tunable models, and task-specific solutions.

Medical Image Segmentation

Medical image analysis is the application of computational algorithms and techniques to extract meaningful information from medical images for diagnosis, treatment planning, and research purposes. Medical image analysis requires large image files and often elastic storage and accelerated computing.

Download Data From the Sequence Read Archive (SRA)

Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using this notebook, including how to use BigQuery to generate your list of Accessions. You can also use BigQuery to create a list of accessions for download using this setup guide and this query guide. Additional example notebooks can be found at this NCBI repo. In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/1_Basic_BigQuery_Examples.ipynb), which goes into more detail on using BigQuery to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature. Further, this notebook does a deep dive on parsing the BigQuery results and may give you some good ideas on how to search for samples from SRA. The SRA metadata and taxonomy analyses are in separate BigQuery tables, you can learn how to join those two tables using SQL from this Powerpoint or from our tutorial here. Finally, NCBI released this workshop that walks through a wide variety of BigQuery applications with NCBI datasets.

Variant Calling

Genomic variant calling is the process of identifying and characterizing genetic variations from DNA sequencing data to understand differences in an individual's genetic makeup.

Query a VCF file in Big Query

The output of genomic variant calling workflows is a file in the variant call format (VCF). These are often large, structured data files that can be searched using database query tools such as Big Query.

Genome Wide Association Studies

Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes.

Proteomics

Proteomics is the study of the entire set of proteins in a cell, tissue, or organism, aiming to understand their structure, function, and interactions to uncover insights into biological processes and diseases. Although most primary proteomic analyses occur in proprietary software platforms, a lot of secondary analysis happens in Jupyter or R notebooks, we give several examples here:

RNAseq and Transcriptome Assembly

RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks.

Transcriptome assembly is the process of reconstructing the complete set of RNA transcripts in a cell or tissue from fragmented sequencing data, providing valuable insights into gene expression and functional analysis.

Single Cell RNAseq

Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems.

ATACseq and Single Cell ATACseq

ATAC-seq is a technique that allows scientists to understand how DNA is packaged in cells by identifying the regions of DNA that are accessible and potentially involved in gene regulation. -This module walks you through how to work through an ATACseq and single-cell ATACseq workflow on Google Cloud. The module was developed by the University of Nebraska Medical Center for the NIGMS Sandbox Project.

Methylseq

As one of the most abundant and well-studied epigenetic modifications, DNA methylation plays an essential role in normal cell development and has various effects on transcription, genome stability, and DNA packaging within cells. Methylseq is a technique to identify methylated regions of the genome.

Metagenomics

Metagenomics is the study of genetic material collected directly from environmental samples, enabling the exploration of microbial communities, their diversity, and their functional potential, without the need for laboratory culturing. -This module walks you through conducting a metagenomic analysis using command line and Nextflow. The module was developed by the University of South Dakota as part of the NIGMS Sandbox Project.

Multiomic Analysis and Biomarker Discovery

Multiomic analysis involves integrating data across modalities (e. g. genomic, transcriptomic, phenotypic) to generate additive insights.

Biomarker discovery is the process of identifying specific molecules or characteristics that can serve as indicators of biological processes, diseases, or treatment responses, aiding in diagnosis, prognosis, and personalized medicine. Biomarker discovery is typically conducted through comprehensive analysis of various types of data, such as genomics, proteomics, metabolomics, and clinical data, using advanced techniques including high-throughput screening, bioinformatics, and statistical analysis to identify patterns or signatures that differentiate between healthy and diseased individuals, or responders and non-responders to specific treatments.

BLAST+

NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information.

Long Read Sequence Analysis

Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. You can find a list and description of notebooks here, or clone the GitHub repo. Note that these notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks.

Drug Discovery

The Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium created a series of Jupyter notebooks that walk you through the ATOM approach to Drug Discovery.

These notebooks were created to run in Google Colab, so if you run them in Google Cloud, you will need to make a few modification. First, we recommend you use a Google Managed Notebook rather than a User-Managed notebook simply because the Google Managed notebooks already have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out %tensorflow_version 2.x since that is a Colab-specific command. You will also need to pip install a few packages as needed. If you get errors with deepchem, try running pip install --pre deepchem[tensorflow] and/or pip install --pre deepchem[torch]. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution to this issue.

Using Google Batch

You can interact with Google Batch directly to submit commands, or more commonly you can interact with it through orchestration engines like Nextflow and Cromwell, etc. We have tutorials that utilize Google Batch using Nextflow where we run the nf-core Methylseq pipeline, as well as several from the NIGMS Sandbox including transcriptome assembly, multiomics, methylseq, and metagenomics.

Using the Life Sciences API (depreciated)

Life Science API is depreciated on GCP and will no longer be available by July 8, 2025 on the platform, we recommend using Google Batch instead. For now you can still interact with the Life Sciences API directly to submit commands, or more commonly you can interact with it through orchestration engines like Snakemake, as of now this workflow manager only supports Life Sciences API.

Public Data Sets

Google has a lot of public datasets available that you can use for your testing. These can be viewed here and can be accessed via BigQuery or directly from the cloud bucket. For example, to view Phase 3 1k Genomes at the command line type gsutil ls gs://genomics-public-data/1000-genomes-phase-3.