aws-samples / amazon-omics-tutorials

Apache License 2.0
57 stars 24 forks source link

Several potential issues with analytics tutorial related to IAM, region, and Athena setup #6

Closed danpeck161 closed 1 year ago

danpeck161 commented 1 year ago

I noticed several potential issues with 200-omics_analytics.ipynb as I was working through it within a Sagemaker notebook. Some might be solved by updating the code, while others are just general observations that I feel create a poor first impression for customers using these tutorials to explore the Omics service for the first time.

1) Needed to add RAM permissions to the Sagemaker role to create the variant store:

response = omics.create_variant_store(
    name=var_store_name, 
    reference={"referenceArn": get_reference_arn(ref_name, omics)}
)

var_store = response
response

The Omics role created at the beginning of the notebook contains the needed RAM permissions, but this command was executed by the Sagemaker role. The variant store simply failed to create and the only way I was able to determine the reason was via Cloudtrail or get-variant-store CLI command - there was no error in the notebook or Omics console.

In a similar vain, I ended up needing to add several IAM permissions for RAM, Glue, Athena, and Lake Formation to my Sagemaker Role to complete the tutorial within the notebook. Here is the list of permissions I ended up needing to add:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "omics:*",
                "ram:AcceptResourceShareInvitation",
                "ram:GetResourceShareInvitations",
                "ram:GetResourceShares",
                "ram:ListResources",
                "glue:CreateTable",
                "glue:DeleteTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:ListWorkGroups",
                "athena:CreateWorkGroup",
                "athena:GetQueryExecution",
                "athena:GetWorkGroup",
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

2) It's a little confusing that the other 2 tutorials bring in data from us-east-1, but this one pulls from us-west-2 - especially since the setup process the storage tutorial is kind of a pre-requisite. It'd be nice if the 3 tutorials were more in-sync in this regard, but as a workaround I performed the following to copy the data into my own s3 bucket in region us-east-1:

aws s3 cp s3://1000genomes-dragen/data/precisionFDA/hg38-graph-based/HG002/HG002.hard-filtered.vcf.gz s3://<MY_US-EAST-1-BUCKET>/<MY_FOLDER>/ --acl bucket-owner-full-control
aws s3 cp s3://1000genomes-dragen/data/precisionFDA/hg38-graph-based/HG003/HG003.hard-filtered.vcf.gz s3://<MY_US-EAST-1-BUCKET>/<MY_FOLDER>/ --acl bucket-owner-full-control
aws s3 cp s3://1000genomes-dragen/data/precisionFDA/hg38-graph-based/HG004/HG004.hard-filtered.vcf.gz s3://<MY_US-EAST-1-BUCKET>/<MY_FOLDER>/ --acl bucket-owner-full-control

and once more for the annotation store: aws s3 cp s3://aws-genomics-datasets/omics-e2e/clinvar.vcf.gz s3://<MY_US-EAST-1-BUCKET>/<MY_FOLDER>/ --acl bucket-owner-full-control

Then updated the SOURCE_VARIANT_URI and SOURCE_ANNOTATION_URI variables to the new s3 paths.

SOURCE_VARIANT_URI = "s3://<MY_US-EAST-1-BUCKET>/<MY_FOLDER>/"
SOURCE_ANNOTATION_URI = "s3://<MY_US-EAST-1-BUCKET>/<MY_FOLDER>/clinvar.vcf.gz"

3) To create an Athena workgroup with version 3, I had to run that cell twice to correctly populate ['Name'] in "athena_workgroup" variable for later.

4) Once a version 3 workgroup named "omics" is created, I had to manually add a query output location in the console as follows: Athena -> Workgroups (left sidebar) -> omics -> Edit -> Query result configuration -> add s3 path to Location of query result

Otherwise, I was encountering the following error when trying to run the query in the notebook later: WaiterError: Waiter BucketExists failed

AndersVO commented 1 year ago

Hi Dan

Thanks for raising this issue it definitely helped me getting through the tutorial!

For anyone else who might be stuck creating the reference store to fit with this tutorial you can use the following:

ref_name = 'GRCh38'

ref_import_job = omics.start_reference_import_job(
    referenceStoreId=get_ref_store_id(omics), 
    roleArn=get_role_arn(omics_iam_name),
    sources=[{
        'sourceFile': "s3://1000genomes-dragen/references/fasta/GRCh38_full_analysis_set_plus_decoy_hla.fa",
        'name': ref_name,
        'tags': {'SourceLocation': '1kg'}
    }])

It uses the sourceFile from the dataset used in the tutorial which means that you can skip all the copying of S3 buckets in Dan's original issue.

Hope it helps.

wleepang commented 1 year ago

Should be resolved in #19