UC 12 - (Xihong) Whole Genome Sequencing Association Analysis pipeline

NoopDog commented 3 years ago

Interop Contact: Active in 2021: Active Researchers: Xihong Lin (Harvard T.H. Chan School of Public Health)

Analysis Question:

Large-scale Whole Genome Sequencing (WGS) studies and biobanks have been rapidly generating up to millions of whole genomes. Examples of large-scale WGS studies include the NHGRI Genome Sequencing Program (GSP), which has sequenced 140,000+ multi-ethnic whole genomes and 220,000 whole exomes, and the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program, which has sequenced 190,000+ multi-ethnic whole genomes.

Analysis of WGS data is challenged by massive coding and non-coding rare variants (RVs) and the need to functionally annotate these variants. We recently developed a whole-genome variant functional annotation database and portal FAVOR that assembles rich functional annotations from a variety of data sources to describe the functional landscape and regulatory characteristics of variants from large-scale WGS data. We also developed a novel RV association test STAAR that empowers the RV association analysis by effectively incorporating multi-faceted functional annotations provided by FAVOR.

This project aims to develop a comprehensive cloud-based open-source rare variant analysis toolset to perform powerful, scalable, and resource-efficient functional annotations and phenotype-genotype rare variant association studies.

First, we will develop an open-source pipeline, FAVORannotator, for functionally annotating and efficiently storing the genotype and variant functional annotation data of a WGS/biobank study in an all-in-one file format to facilitate downstream RV association analysis.

Second, we will provide an all-in-one and open-source cloud-based pipeline, STAARpipeline, for comprehensive and scalable rare variant association analysis and summary of large-scale WGS and Biobank data using STAAR by integrating variant functional annotations provided by the FAVOR annotator, and visualization of the RV association results.

Analysis Plan:

We have obtained IRB approval for the TOPMed dataset and GSP dataset.
We have obtained dbGaP access to these studies.
Develop functional annotation pipeline, FAVORannotator, in Biodata Catalyst and AnVIL using the Terra platform.
Develop RV association analysis pipeline, STAARpipeline, in Biodata Catalyst and AnVIL using the Terra platform.
Functionally annotate TOPMed Freeze 8 and GSP Freeze 2 data using FAVORannotator
Perform association analysis of TOPMed Freeze 8 and GSP Freeze 2 CAD data using STAARpipeline.
Store WGS common and rare variant summary statistics of TOPMed Freeze 8 lipids and GSP Freeze 2 CAD in STAARsummary.

linikujp commented 3 years ago

Updates: Met with Xihong and Michael S on July 1,2021. Identified potential cloud cost resource for the project. However, the interoperability use case still needs to be identified within this research project.

jackDiGi commented 3 years ago

meeting scheduled 20 July to resolve remaining issues and finalize

linikujp commented 3 years ago

The PI is currently working on funds to support the implementation of FAVORannotator and STAARpipeline in AnVIL. One possibility is to use GCP $300 credits to try-out.

linikujp commented 1 year ago

Decided to make this case to be inactive as there is no funding to support continuous development.

NIH-NCPI / NCPI_use_case_tracker

UC 12 - (Xihong) Whole Genome Sequencing Association Analysis pipeline #12

Analysis Question:

Analysis Plan: