data-yaml / auto-analyze

Benchling to Omics to NextFlow to Quilt
0 stars 0 forks source link

run omics #7

Closed drernie closed 10 months ago

drernie commented 10 months ago

Run the existing pipeline, node-style.

https://github.com/aws-samples/aws-healthomics-eventbridge-integration

cdk bootstrap aws://<ACCOUNTID>/<AWS-REGION>   # do this if your account hasn't been bootstraped
cdk synth
cdk deploy --all

Before you test the solution, you need to subscribe to the Amazon SNS topic (name should be *_workflow_status_topic) with your email address to receive email notifications in case the HealthOmics workflow runs fail. Follow instructions here on how to subscribe: https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html

Below is an example CSV used for testing in this solution:

sample_name,read_group,fastq_1,fastq_2,platform
NA12878,Sample_U0a,s3://aws-genomics-static-{aws-region}/omics-tutorials/data/fastq/NA12878/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz,s3://aws-genomics-static-{aws-region}/omics-tutorials/data/fastq/NA12878/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz,illumina

We will be using publicly available test FASTQ files hosted in public AWS test data buckets. You can use your own FASTQ files in your S3 buckets as well.

Use the provided test file in the solution code: workflows/vep/test_data/sample_manifest_with_test_data.csv

Replace the {aws-region} string in the file contents with the AWS region in which you have deployed the solution. The publicly available FASTQ data referenced in the CSV is available in all the regions where AWS HealthOmics is available. Upload this file to the input bucket created by the solution under the “fastq” prefix

aws s3 cp sample_manifest_with_test_data.csv s3://<INPUTBUCKET>/fastqs/

drernie commented 10 months ago

AWS HealthOmics is integrated with Amazon EventBridge which enables downstream event-driven automation. We have set up two rules within EventBridge.

On successful completion of the "GATK-BP Germline fq2vcf for 30x genome" workflow, a Lambda function - post initial - is triggered to launch the next HealthOmics workflow – VEP – using the output (i.e. gVCF file) of the previous workflow. The outputs of the previous workflow run are BAM and gVCF files, which can be verified by inspecting the output S3 bucket and prefix for that workflow run.

On workflow failure, an Amazon SNS topic is the rule target. If you have subscribed to the SNS topic, you should receive a failure notification to your email that you used.

The successful completion of the "GATK-BP Germline fq2vcf for 30x genome" workflow triggers the post-initial Lambda function that:

Verifies the outputs of this workflow; Prepares the input payload for the next workflow, VEP, based on event and pre-configured data; and Launches the workflow – VEP – using the HealthOmics API. Post VEP workflow

Upon successful workflow completion of the VEP workflow run, outputs of the workflow are uploaded to the output S3 location. Similar to the GATK-BP Germline fq2vcf workflow, if the workflow fails or times out, the configured EventBridge rule triggers an SNS notification to notify the email distribution list so that appropriate actions can be taken by users.