aws-quickstart / quickstart-illumina-dragen

AWS Quick Start Team
Apache License 2.0
24 stars 26 forks source link

Unable to run batch jobs - S3 permission denied #32

Closed ury closed 3 years ago

ury commented 4 years ago

Hi, I'm posting this issue after a very long session with AWS support. I've recently deployed the Dragen QuickStart stack and failed to execute any batch job. I was following this quick start guide.

The first issue was that jobs were stuck in RUNNABLE state, and instances were not launching at all. AWS support identified the issue in the CloudTrail entry:

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROA3HVVBYXWLY5UAUGNE:aws-batch",
        "arn": "arn:aws:sts::772400072172:assumed-role/Dragen-Test-DragenStack-OW5XIS9SE-BatchServiceRole-17EM0YXZ7OKY1/aws-batch",
        "accountId": "772400072172",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROA3HVVBYXWLY5UAUGNE",
                "arn": "arn:aws:iam::772400072172:role/Dragen-Test-DragenStack-OW5XIS9SE-BatchServiceRole-17EM0YXZ7OKY1",
                "accountId": "772400072172",
                "userName": "Dragen-Test-DragenStack-OW5XIS9SE-BatchServiceRole-17EM0YXZ7OKY1"
            },
            "webIdFederationData": {},
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": "2020-05-29T05:56:26Z"
            }
        },
        "invokedBy": "batch.amazonaws.com"
    },
    "eventTime": "2020-05-29T05:56:26Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RunInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "batch.amazonaws.com",
    "userAgent": "batch.amazonaws.com",
    "errorCode": "Client.InvalidSubnetID.NotFound",
    "errorMessage": "The subnet ID 'subnet-08faea21795a51127,subnet-0e96f8eb226426d35' does not exist",
        "subnetId": "subnet-08faea21795a51127,subnet-0e96f8eb226426d35",
        "disableApiTermination": false,
        "clientToken": "11a2f3c8-bc56-4923-8242-478d96a0127b",
        "iamInstanceProfile": {
            "name": "Dragen-Test-DragenStack-OW5XIS9-DragenInstanceRole-1GQ45N169GNDY"
        }
    },
    "responseElements": null,
    "requestID": "a261ac83-c96c-4b69-b9e6-ff27a1acd598",
    "eventID": "2954e63c-34b4-41ad-961b-616b2a2bb64d",
    "eventType": "AwsApiCall",
    "recipientAccountId": "772400072172"
}

It looks like 2 subnets are being passed for the creation of the instance, though only one subnet should be passed. The support personnel opened a ticket with you, but I'm unaware of its status.

We have deployed a workaround, in which only the first subnet ID is passed, which resolved this issue - instances are launched and jobs are running.

The next step for me was to create the reference HT files. I have made multiple attempts to execute this, but always got S3 errors (either "access denied" or "not found", depending on the type of URLs I used for the reference file - s3:// or https://). I double-checked, together with AWS support personnel, the permissions stack roles, and they seem to be ok. I can attach the policies here, but I'm pretty sure the problem isn't there. I also used public S3 bucket URLs for the reference files.

Following are error messages received from the Dragen log files, depending on the command used (btw, I didn't find any example of an actualdragen --build-hash-table truecommand, so I can tell when an S3 URL is expected/supported, and where should I use HTTPS://):

ERROR: Cannot read reference FASTA file s3://dragen-test-bucket/hg38.fa: Permission denied

ERROR: Reference file https://dragen-test-bucket.s3.amazonaws.com/hg38.fa does not exist

ERROR: Cannot read reference FASTA file https://broad-references.s3.amazonaws.com/hg38/v0/GRCh38.primary_assembly.genome.fa: Permission denied

ERROR: Assertion failed in /data/jenkins/workspace/dragen_release_3.5/src/host/infra/storage/infra_filesystem_utils.cpp line 596 -- Bucket broad-references/hg38/v0 error retrieving location

It is worth mentioning that the Dragen log file are written to the bucket specified in the Dragen stack, so write access is certainly working.

AWS support reproduced the issue in their environment, as the log file shows:

2020-06-01T11:05:07.501+05:30 DRAGEN finished normally
2020-06-01T11:05:07.504+05:30 Completed Partial Reconfig for FPGA
2020-06-01T11:05:07.504+05:30 Executing /opt/edico/bin/dragen_reset -cv
2020-06-01T11:05:07.891+05:30 Output directory does not exist - creating /ephemeral/423ca407-93be-4b2f-bea9-4ef56796bf93
2020-06-01T11:05:07.891+05:30 Executing /opt/edico/bin/dragen --output_status_file /ephemeral/423ca407-93be-4b2f-bea9-4ef56796bf93/job-speedometer.log --intermediate-results-dir /ephemeral/ --lic-no-print > /ephemeral/423ca407-93be-4b2f-bea9-4ef56796bf93/dragen_log_1590989708.txt 2>&1
2020-06-01T11:05:08.012+05:30 Error: Output S3 location not specified!
2020-06-01T11:05:08.012+05:30 Removing Output dir /ephemeral/423ca407-93be-4b2f-bea9-4ef56796bf93
2020-06-01T11:05:08.014+05:30 Job is exiting with code 3
2020-06-01T11:05:08.014+05:30 Caught SystemExit: Exiting with status 3 ==========

Since we modified the stack templates to make it work, we might have caused this error to be generated, though I don't see how.

I would be more than happy to provide any additional information you require in order to investigate this.

vsnyc commented 4 years ago

@ury - Hi Ury, thanks for the investigation on the subnet ID, I'll work with Illumina to get that added in.

The S3 issue needs a deeper dive. There was a similar issue reported in https://github.com/aws-quickstart/quickstart-illumina-dragen/issues/28#issuecomment-614314312, but the customer was able to resolve the S3 access issue by themselves.

In your snippet, I see two S3 buckets mentioned: dragen-test-bucket, broad-references - what did you use for the GenomicsS3Bucket parameter?

Also, can you attach your job definition json file that you were passing to submit-job API to AWS Batch?

vsnyc commented 4 years ago

Regarding the usage of dragen --build-hash-table true, as far as I recall having the hash table reference is a pre-requisite to running the batch jobs. The build-hash-table command itself doesn't take files from S3.

See the User guide for an example.

dragen --build-hash-table true --ht-reference /staging/human/reference/hg19/hg19.fa \
--output-dir /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--ht-alt-liftover /opt/edico/liftover/hg19_alt_liftover.sam

@partha-edico - could you please confirm?

ury commented 4 years ago

Thanks @vsnyc

  1. dragen-test-bucket is my GenomicsS3Bucket. broad-references is a public S3 bucket from Broad Institute, providing hg38 and other reference files.
  2. regarding the dragen --build-hash-table true command, I was referring to it in the AWS Quick Start context, with a reference file residing on an S3 bucket. In the example you provided, the reference file is local.
  3. Following is the job definition created by the Dragen Quick Start stack:
{
    "jobDefinitionName": "dragen",
    "jobDefinitionArn": "arn:aws:batch:us-east-1:772400072172:job-definition/dragen:2",
    "revision": 2,
    "status": "ACTIVE",
    "type": "container",
    "parameters": {},
    "retryStrategy": {
        "attempts": 1
    },
    "containerProperties": {
        "image": "772400072172.dkr.ecr.us-east-1.amazonaws.com/drage-drage-hgsehqlagf6s:dragen",
        "vcpus": 8,
        "memory": 120000,
        "command": [],
        "jobRoleArn": "arn:aws:iam::772400072172:role/Dragen-Test-2-DragenStack-QK610ETVYO-DragenJobRole-135PARIDCHT3F",
        "volumes": [
            {
                "host": {
                    "sourcePath": "/scratch"
                },
                "name": "docker_scratch"
            },
            {
                "host": {
                    "sourcePath": "/ephemeral"
                },
                "name": "docker_ephemeral"
            },
            {
                "host": {
                    "sourcePath": "/opt/edico"
                },
                "name": "docker_opt_edico"
            },
            {
                "host": {
                    "sourcePath": "/var/lib/edico"
                },
                "name": "docker_var_lib_edico"
            }
        ],
        "environment": [],
        "mountPoints": [
            {
                "containerPath": "/scratch",
                "readOnly": false,
                "sourceVolume": "docker_scratch"
            },
            {
                "containerPath": "/ephemeral",
                "readOnly": false,
                "sourceVolume": "docker_ephemeral"
            },
            {
                "containerPath": "/opt/edico",
                "readOnly": false,
                "sourceVolume": "docker_opt_edico"
            },
            {
                "containerPath": "/var/lib/edico",
                "readOnly": false,
                "sourceVolume": "docker_var_lib_edico"
            }
        ],
        "ulimits": [],
        "resourceRequirements": []
    }
}
ajfriedman18 commented 4 years ago

@ury what is your submit job command? can you also describe your compute environment?

ury commented 4 years ago

@ajfriedman18 I tried many variations of the build-hash-table command, mainly trying various s3 and https URLs for the hg38.fa reference file (the --ht-reference parameter value) I was using the on-demand compute environment generated by the quick start stack, without any modifications.

ajfriedman18 commented 4 years ago

@ury feel free to email me at ajfriedm [at] amazon [.] com

There are several things I'll want to step through, but that's probably better served for an email than the issue. Can summarize root cause here once we determined.

partha-edico commented 4 years ago

Hi folks! Unfortunately the hash-table generation is not currently supported in the Dragen Quickstart batch scripts. This reason is that this is a rather infrequent process that does not benefit from running as a batch job. The current recommendation is to run it in a manually created EC2 F1 instance using the Dragen AMI, and then recursively upload the resulting output directory to S3. Then it can be used by subsequent Dragen jobs launched with Quickstart. Hope that helps, but if any questions let me know. thanks -partha

vsnyc commented 4 years ago

I'd like to revisit the point raised about the subnets:

It looks like 2 subnets are being passed for the creation of the instance, though only one subnet should be passed. The support personnel opened a ticket with you, but I'm unaware of its status. We have deployed a workaround, in which only the first subnet ID is passed, which resolved this issue - instances are launched and jobs are running.

I tested the default templates and the subnet configuration is correct. I don't know how you ran into the error you described, but I tested the Quick Start in a new VPC and the jobs started as expected. AWS Batch does take a list of subnet IDs.

I tested with the following job definition:

{
    "jobName": "build-hash-table1",
    "jobQueue": "dragen-queue",
    "jobDefinition": "dragen",
    "containerOverrides": {
        "vcpus": 8,
        "memory": 120000,
        "command": [
            "--build-hash-table true", 
            "--ht-reference",
            "s3://vsnyc-dragen-test-us-west-2/staging/reference/upstream1000.fa",
            "--output-dir",
            "s3://vsnyc-dragen-test-us-west-2/staging/reference/hg19/hg19.fa",
            "–ht-alt-aware-validate=false"
        ]
    },
    "retryStrategy": {
        "attempts": 1
    }
}

Since the --build-hash-table command is not supported, I did get an expected failure, but the job did run.


 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                                                                     message                                                                                                                                      |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1591302591103 | [DEBUG] Dragen input commands: --build-hash-table true --ht-reference https://vsnyc-dragen-test-us-west-2.s3-us-west-2.amazonaws.com/staging/reference/upstream1000.fa --output-dir s3://vsnyc-dragen-test-us-west-2/staging/reference/hg19/hg19.fa –ht-alt-aware-validate=false |
| 1591302591103 | Setting resource 3 to 10485760                                                                                                                                                                                                                                                   |
| 1591302591103 | Setting resource 6 to 16384                                                                                                                                                                                                                                                      |
| 1591302591103 | Setting resource 7 to 65535                                                                                                                                                                                                                                                      |
| 1591302591103 | Downloading reference files                                                                                                                                                                                                                                                      |
| 1591302591103 | Warning: No reference HT directory URL specified!                                                                                                                                                                                                                                |
| 1591302591103 | Downloading misc inputs (csv, bed)                                                                                                                                                                                                                                               |
| 1591302591103 | Run Analysis job                                                                                                                                                                                                                                                                 |
| 1591302591103 | Executing /opt/edico/bin/dragen --partial-reconfig DNA-MAPPER --ignore-version-check true -Z 0                                                                                                                                                                                   |
| 1591302591190 | Command Line: /opt/edico/bin/dragen --partial-reconfig DNA-MAPPER --ignore-version-check true -Z 0                                                                                                                                                                               |
| 1591302591198 | DRAGEN Host Software Version 05.021.510.3.5.7 and Bio-IT Processor Version 0x04261818                                                                                                                                                                                            |
| 1591302591198 | Generating run log at /var/log/dragen/dragen_run_1591302591189_11.log                                                                                                                                                                                                            |
| 1591302591202 | AutoDetected reference: UNKNOWN                                                                                                                                                                                                                                                  |
| 1591302591212 | INFO: AGFI currently loaded agfi-03b3cf29b824918ee                                                                                                                                                                                                                               |
| 1591302591212 | ==================================================================                                                                                                                                                                                                               |
| 1591302591212 | Downloading DNA Map/Align (public) HW bitstream (agfi-03eaf3cf5c9811bcc) - do not interrupt                                                                                                                                                                                      |
| 1591302591212 | ==================================================================                                                                                                                                                                                                               |
| 1591302591212 | WARNING: Bypassing bitstream version check! Currently loaded version: 0x05021507                                                                                                                                                                                                 |
| 1591302593891 | ..                                                                                                                                                                                                                                                                               |
| 1591302593891 | AGFI: Downloaded HW bitstream agfi-03eaf3cf5c9811bcc                                                                                                                                                                                                                             |
| 1591302593893 | RUN TIME                                                                        Time partial reconfiguration                                      00:00:02.679   2.68                                                                                                            |
| 1591302593893 | RUN TIME                                                                        Total runtime                                                     00:00:02.706   2.71                                                                                                            |
| 1591302593893 | ==================================================================                                                                                                                                                                                                               |
| 1591302593894 | DRAGEN finished normally                                                                                                                                                                                                                                                         |
| 1591302593897 | Completed Partial Reconfig for FPGA                                                                                                                                                                                                                                              |
| 1591302593897 | Executing /opt/edico/bin/dragen_reset -cv                                                                                                                                                                                                                                        |
| 1591302594534 | Output directory does not exist - creating /ephemeral/b8e38abe-935a-4564-a963-e44f76cdeb9d                                                                                                                                                                                       |
| 1591302594534 | Unhandled exception in dragen_qs: <type 'exceptions.UnicodeDecodeError'>                                                                                                                                                                                                         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------