Closed ury closed 3 years ago
@ury - Hi Ury, thanks for the investigation on the subnet ID, I'll work with Illumina to get that added in.
The S3 issue needs a deeper dive. There was a similar issue reported in https://github.com/aws-quickstart/quickstart-illumina-dragen/issues/28#issuecomment-614314312, but the customer was able to resolve the S3 access issue by themselves.
In your snippet, I see two S3 buckets mentioned: dragen-test-bucket
, broad-references
- what did you use for the GenomicsS3Bucket
parameter?
Also, can you attach your job definition json file that you were passing to submit-job
API to AWS Batch?
Regarding the usage of dragen --build-hash-table true
, as far as I recall having the hash table reference is a pre-requisite to running the batch jobs. The build-hash-table
command itself doesn't take files from S3.
See the User guide for an example.
dragen --build-hash-table true --ht-reference /staging/human/reference/hg19/hg19.fa \
--output-dir /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--ht-alt-liftover /opt/edico/liftover/hg19_alt_liftover.sam
@partha-edico - could you please confirm?
Thanks @vsnyc
dragen --build-hash-table true
command, I was referring to it in the AWS Quick Start context, with a reference file residing on an S3 bucket. In the example you provided, the reference file is local.{
"jobDefinitionName": "dragen",
"jobDefinitionArn": "arn:aws:batch:us-east-1:772400072172:job-definition/dragen:2",
"revision": 2,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"retryStrategy": {
"attempts": 1
},
"containerProperties": {
"image": "772400072172.dkr.ecr.us-east-1.amazonaws.com/drage-drage-hgsehqlagf6s:dragen",
"vcpus": 8,
"memory": 120000,
"command": [],
"jobRoleArn": "arn:aws:iam::772400072172:role/Dragen-Test-2-DragenStack-QK610ETVYO-DragenJobRole-135PARIDCHT3F",
"volumes": [
{
"host": {
"sourcePath": "/scratch"
},
"name": "docker_scratch"
},
{
"host": {
"sourcePath": "/ephemeral"
},
"name": "docker_ephemeral"
},
{
"host": {
"sourcePath": "/opt/edico"
},
"name": "docker_opt_edico"
},
{
"host": {
"sourcePath": "/var/lib/edico"
},
"name": "docker_var_lib_edico"
}
],
"environment": [],
"mountPoints": [
{
"containerPath": "/scratch",
"readOnly": false,
"sourceVolume": "docker_scratch"
},
{
"containerPath": "/ephemeral",
"readOnly": false,
"sourceVolume": "docker_ephemeral"
},
{
"containerPath": "/opt/edico",
"readOnly": false,
"sourceVolume": "docker_opt_edico"
},
{
"containerPath": "/var/lib/edico",
"readOnly": false,
"sourceVolume": "docker_var_lib_edico"
}
],
"ulimits": [],
"resourceRequirements": []
}
}
@ury what is your submit job command? can you also describe your compute environment?
@ajfriedman18 I tried many variations of the build-hash-table command, mainly trying various s3 and https URLs for the hg38.fa reference file (the --ht-reference parameter value) I was using the on-demand compute environment generated by the quick start stack, without any modifications.
@ury feel free to email me at ajfriedm [at] amazon [.] com
There are several things I'll want to step through, but that's probably better served for an email than the issue. Can summarize root cause here once we determined.
Hi folks! Unfortunately the hash-table generation is not currently supported in the Dragen Quickstart batch scripts. This reason is that this is a rather infrequent process that does not benefit from running as a batch job. The current recommendation is to run it in a manually created EC2 F1 instance using the Dragen AMI, and then recursively upload the resulting output directory to S3. Then it can be used by subsequent Dragen jobs launched with Quickstart. Hope that helps, but if any questions let me know. thanks -partha
I'd like to revisit the point raised about the subnets:
It looks like 2 subnets are being passed for the creation of the instance, though only one subnet should be passed. The support personnel opened a ticket with you, but I'm unaware of its status. We have deployed a workaround, in which only the first subnet ID is passed, which resolved this issue - instances are launched and jobs are running.
I tested the default templates and the subnet configuration is correct. I don't know how you ran into the error you described, but I tested the Quick Start in a new VPC and the jobs started as expected. AWS Batch does take a list of subnet IDs.
I tested with the following job definition:
{
"jobName": "build-hash-table1",
"jobQueue": "dragen-queue",
"jobDefinition": "dragen",
"containerOverrides": {
"vcpus": 8,
"memory": 120000,
"command": [
"--build-hash-table true",
"--ht-reference",
"s3://vsnyc-dragen-test-us-west-2/staging/reference/upstream1000.fa",
"--output-dir",
"s3://vsnyc-dragen-test-us-west-2/staging/reference/hg19/hg19.fa",
"–ht-alt-aware-validate=false"
]
},
"retryStrategy": {
"attempts": 1
}
}
Since the --build-hash-table
command is not supported, I did get an expected failure, but the job did run.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| timestamp | message |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1591302591103 | [DEBUG] Dragen input commands: --build-hash-table true --ht-reference https://vsnyc-dragen-test-us-west-2.s3-us-west-2.amazonaws.com/staging/reference/upstream1000.fa --output-dir s3://vsnyc-dragen-test-us-west-2/staging/reference/hg19/hg19.fa –ht-alt-aware-validate=false |
| 1591302591103 | Setting resource 3 to 10485760 |
| 1591302591103 | Setting resource 6 to 16384 |
| 1591302591103 | Setting resource 7 to 65535 |
| 1591302591103 | Downloading reference files |
| 1591302591103 | Warning: No reference HT directory URL specified! |
| 1591302591103 | Downloading misc inputs (csv, bed) |
| 1591302591103 | Run Analysis job |
| 1591302591103 | Executing /opt/edico/bin/dragen --partial-reconfig DNA-MAPPER --ignore-version-check true -Z 0 |
| 1591302591190 | Command Line: /opt/edico/bin/dragen --partial-reconfig DNA-MAPPER --ignore-version-check true -Z 0 |
| 1591302591198 | DRAGEN Host Software Version 05.021.510.3.5.7 and Bio-IT Processor Version 0x04261818 |
| 1591302591198 | Generating run log at /var/log/dragen/dragen_run_1591302591189_11.log |
| 1591302591202 | AutoDetected reference: UNKNOWN |
| 1591302591212 | INFO: AGFI currently loaded agfi-03b3cf29b824918ee |
| 1591302591212 | ================================================================== |
| 1591302591212 | Downloading DNA Map/Align (public) HW bitstream (agfi-03eaf3cf5c9811bcc) - do not interrupt |
| 1591302591212 | ================================================================== |
| 1591302591212 | WARNING: Bypassing bitstream version check! Currently loaded version: 0x05021507 |
| 1591302593891 | .. |
| 1591302593891 | AGFI: Downloaded HW bitstream agfi-03eaf3cf5c9811bcc |
| 1591302593893 | RUN TIME Time partial reconfiguration 00:00:02.679 2.68 |
| 1591302593893 | RUN TIME Total runtime 00:00:02.706 2.71 |
| 1591302593893 | ================================================================== |
| 1591302593894 | DRAGEN finished normally |
| 1591302593897 | Completed Partial Reconfig for FPGA |
| 1591302593897 | Executing /opt/edico/bin/dragen_reset -cv |
| 1591302594534 | Output directory does not exist - creating /ephemeral/b8e38abe-935a-4564-a963-e44f76cdeb9d |
| 1591302594534 | Unhandled exception in dragen_qs: <type 'exceptions.UnicodeDecodeError'> |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hi, I'm posting this issue after a very long session with AWS support. I've recently deployed the Dragen QuickStart stack and failed to execute any batch job. I was following this quick start guide.
The first issue was that jobs were stuck in RUNNABLE state, and instances were not launching at all. AWS support identified the issue in the CloudTrail entry:
It looks like 2 subnets are being passed for the creation of the instance, though only one subnet should be passed. The support personnel opened a ticket with you, but I'm unaware of its status.
We have deployed a workaround, in which only the first subnet ID is passed, which resolved this issue - instances are launched and jobs are running.
The next step for me was to create the reference HT files. I have made multiple attempts to execute this, but always got S3 errors (either "access denied" or "not found", depending on the type of URLs I used for the reference file - s3:// or https://). I double-checked, together with AWS support personnel, the permissions stack roles, and they seem to be ok. I can attach the policies here, but I'm pretty sure the problem isn't there. I also used public S3 bucket URLs for the reference files.
Following are error messages received from the Dragen log files, depending on the command used (btw, I didn't find any example of an actual
dragen --build-hash-table true
command, so I can tell when an S3 URL is expected/supported, and where should I use HTTPS://):It is worth mentioning that the Dragen log file are written to the bucket specified in the Dragen stack, so write access is certainly working.
AWS support reproduced the issue in their environment, as the log file shows:
Since we modified the stack templates to make it work, we might have caused this error to be generated, though I don't see how.
I would be more than happy to provide any additional information you require in order to investigate this.