billingross / genomics-england-challenge

Technical challenge for Genomics England bioinformatics engineer role
MIT License
0 stars 0 forks source link

Create sample WDL to run on AWS Health Omics #7

Open billingross opened 1 month ago

billingross commented 1 month ago
billingross commented 1 month ago

Sample workflow to list all the samples in a VCF:

workflow list_vcf_samples {
    input {
        File input_vcf
    }

    String output_vcf_name = "${input_vcf}.samples.txt"

    call ListSamples {
            input:
                input_vcf = input_vcf
        }
    }

    output {
        File output_file = ListSamples.output_file
    }
}

task ListSamples {
    input {
        File input_vcf
    }

    String output_file_name = "${input_vcf}.samples.txt"

    command {
        bcftools query -l ${input_vcf} > ${output_file_name}
    }
    output {
        File output_file = "${output_file_name}"
    }
    runtime {
        docker: "public.ecr.aws/biocontainers/bcftools:1.20--h8b25389_0",
        memory: "~2 GiB",
        cpu: 2
    }   
}
billingross commented 1 month ago

Locally creating zipped workflow definition:

zip list-vcf-samples.zip workflow.wdl 
billingross commented 1 month ago

Creating parameter template file:

{
  "input_vcf": {
    "description": "Path to the input VCF stored on S3.",
    "optional": false
  },
  "aws_region": {
    "description": "aws region (e.g. us-east-1). must match were source data is located and where workflow is executed",
    "optional": true
  }
}
billingross commented 1 month ago

Sample VCF FTP address:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chrY.recalibrated_variants.vcf.gz
billingross commented 1 month ago

Writing VCF to S3

billingross commented 1 month ago

Workflow failed after about 11 minutes. Engine log:

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>24ADHSRBQFXJBBP0</RequestId>
<HostId>7DYIoZze4HhxUjeyCGiZHRl5fmizfPO2jsEZYl8na+B0qozYR7URWb6gVf0qUCgzpYPxU6t+mhQ=</HostId>
</Error>

From this doc:

HealthOmics does not support access to public containers.

That could be the issue. I will try copying the image to my own private repository.

Update: The AccessDenied message was referring to me trying to read the logs object in S3, it was not the content of the logs.

billingross commented 1 month ago

In order to add a (bcftools) image to my private repository I need to:

Instructions for doing so are here.

billingross commented 1 month ago

Need to setup AWS CLI in order to push images from my local machine to ECR. Found a StackOverflow article.

billingross commented 1 month ago

Need to grant HealthOmics permissions to ECR: https://docs.aws.amazon.com/omics/latest/dev/workflows-ecr.html#permissions-ecr.

NOTE: If you provide the wrong image path (for instance, the repository path instead of image) it will still return permissions issue.