This github repository has the code and artifacts described in the blog post: Part 2 – Automated End to End Genomics Data Processing and Analysis using Amazon Omics and AWS Step Functions:
my-artifact-bucket
within this account to upload all assets needed for deployment<my-artifact-bucket>
(Installation instructions here: https://github.com/aws/aws-cli/tree/v2#installation)Note that cross region imports are not supported in Amazon Omics today. If you chose to deploy it in another supported region outside of us-east-1, copy the example data used in the solution in a bucket in that region and update the permissions in the CloudFormation templates accordingly
deploy/
directory within the repository. sh upload_artifacts my-artifact-bucket <aws-profile-name>
NOTE
You can use the 2nd argument <aws-profile-name> as an optional argument if you chose to use a specific AWS profile
Navigate to the AWS S3 Console. In the list of buckets, click on <my-artifact-bucket>
and navigate to templates
prefix. Find the file named solution-cfn.yml
. Copy the Object URL (begins with https://) for this object (not the S3 URI).
Navigate to AWS CloudFormation Console. Click on Create Stack
, select Template is ready
and paste the above https:// Object URL into the Amazon S3 URL
field and click Next
.
Fill in the Stack name
with a name of your choice, ArtifactBucketName
with <my-artifact-bucket>
, WorkflowInputsBucketName
& WorkflowOutputsBucketName
with new bucket names of your choice; these buckets will be created.
For the CurrentReferenceStoreId
parameter, if the account that you plan to use has an existing reference store and you want to repurpose it, you can provide the Referernce store ID as the value. (Since only 1 reference store is allowed per account per region). If you don't have one and want to create a new one, provide the value NONE
.
Click Next on the subsequent two pages, then on the Review Submit
:
CloudFormation will now create multiple stacks with all the necessary resources, including Omics resources:
It's recommended that users update omics permissions to least privilege access when leveraging this sample code as a starting point for future production needs.
The CloudFormation Stack should complete deployment in less than an hour
Note that in this solution, the FASTQ files need to be named in the following manner:
<sample_name>_R1.fastq.gz
<sample_name>_R2.fastq.gz
This can be updated to your needs by updating the Python regex in start_workflow_lambda.py
You can also use example FASTQs provided here to test:
s3://aws-genomics-static-us-east-1/omics-e2e/test_fastqs/NA1287820K_R1.fastq.gz
s3://aws-genomics-static-us-east-1/omics-e2e/test_fastqs/NA1287820K_R2.fastq.gz
WorkflowInputsBucketName
in a prefix inputs
. This bucket is configured such that uploaded FASTQ files in this prefix will use S3 notifications to tigger a Lambda function that evaluates the inputs and launches the Step Functions workflow. You can monitor the Step Functions workflow in the AWS Console for Step Functions and navigating to State Machines -> AmazonOmicsEndToEndStepFunction. You should see a running Execution with the Name "GENOMICS\NOTE
Currently if both FASTQs are uploaded simultaneaously, the Step Function trigger lambda has a best-effort mechanism to avoid race conditions by adding a random delay and checking for a running execution with the same sample name. It's recommended to check for a duplicate execution as a precaution.
Since these steps are asynchronous API calls, we leverage tasks to poll for completion and move on to the next step on success. The Step Functions Workflow takes about 3 hours to complete with the test FASTQs provided above and could vary by the size of inputs chosen.
Note that if there is a Step Function Workflow failure, users can refer to this blog on instructions for how to resume a Step Function workflow - https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/
Now that the variants are available in the Omics Variant Store and the pre-loaded annotations in the Omics Annotation store, you can create resource links for them in AWS Lake Formation, Grant permissions to the desired users and query the resuting tables in Amazon Athena to derive insights (see instructions on how to provide Lake Formation permissions in the blog post ). Note that for the example notebook, we used genomic data from the example Ovation Dx NAFLD Whole Genome dataset from the Amazon Data Exchange
The above solution has deployed several AWS resources as part of the CloudFormation stack. If you chose to clean up the resources created by this solution, you can take the following steps:
This library is licensed under the MIT-0 License. See the LICENSE file.
Nadeem Bulsara | Sr. Solutions Architect - Genomics, BDSI | AWS
Sujaya Srinivasan | Genomics Solutions Architect, WWPS | AWS
David Obenshain | Cloud Application Architect, WWPS Professional Services | AWS
Gargi Singh Chhatwal | Sr. Solutions Architect - Federal Civilian, WWPS | AWS
Joshua Broyde | Sr. AI/ML Solutions Architect, BDSI | AWS