Create 1000 genomes copy with instructions on how to run a GWAS analysis

kozbo commented 4 years ago

From Beth:

I have been thinking about how to cater the current GWAS for Anvil. Here are my thoughts:

Currently, biggest difference between these GWAS tutorials we are creating is how to navigate the data model, not the tools or workflows. I have already listed the GWAS workflows in the AnVIL Dockstore org: https://dockstore.org/organizations/anvil

Alisa Manning’s lab has published a GWAS tutorial (this is what I have used to make the BDCat one) that uses something more like a Terra data model, you can try it with your free credits: https://anvil.terra.bio/#workspaces/amp-t2d-op/2019_ASHG_Reproducible_GWAS-V2

To try to showcase the AnVIL system (and not just Terra), I have been thinking we should ingest the training dataset Alisa’s lab made (we already ingested it in Gen3 BDC data model). This data is based off 1000 Genomes: 1) downsampled VCF data for chromosomes 10 and 11, 2) synthetic phenotypic data that is modeled after real traits (bmi, ldl, hdl, etc)

Do you work in the data model working group? How should we set up this data model to be close to what studies are like in the AnVIL right now?

Once the data is in the AnVIL, I can easily help set up with instructions around how to adjust this data model for use in the current GWAS tutorial.

Here is a workspace I made that uses the training data in BDCat, you can see a lot of instructions and notebooks about manipulating the data model: https://terra.biodatacatalyst.nhlbi.nih.gov/#workspaces/biodata-catalyst/BioData%20Catalyst%20GWAS%201000%20Genomes%20Tutorial

┆Issue is synchronized with this Jira Story ┆containerName: AnVIL ┆Issue Number: ANVIL-521 ┆Sprint: Backlog ┆Issue Type: Story

kozbo commented 4 years ago

➤ David Rogers commented:

Let's see if Beth can resource this one.

Lets start with a link to the current GWAS in BioDataCatalyst

kozbo commented 4 years ago

This is waiting for the phonotypic data to be added to Gen3, along with the sub-sampled VCF files (for running a quick GWAS)

Alessandro Culotti

kozbo commented 4 years ago

➤ Alessandro Culotti commented:

Kevin Osborn Beth Sheets Proposed release timeline for the “new” 1000 Genomes data and tutorial_dataset_1 is the week of Oct 12. The reasons for the delay are 1) problems encountered with the manifest generation and data migration with NHLBI, and 2) Very recent developments to ingest the “mini” multi-sample VCFs. We are targeting a release together with the Freeze 8 Batch 3 studies.

kozbo commented 3 years ago

This is blocked by some confusing data structures in the data tables

anvilproject / AnVIL-JIRA

Create 1000 genomes copy with instructions on how to run a GWAS analysis #494