mounting volumes for large file storage on batch jobs

aws-samples / aws-batch-genomics

Software sets up and runs an genome sequencing analysis workflow using AWS Batch and AWS Step Functions.

Apache License 2.0

201 stars 75 forks source link

mounting volumes for large file storage on batch jobs #8

Closed cornhundred closed 6 years ago

cornhundred commented 6 years ago

It looks like you are using mounted volumes for storing large files (e.g. reference genomes) on the batch job containers. I see that you are using docker_scratch volumes in the cloudformation YAML, but it is unclear how that volume is being set up from the YAML. Also, where is mount_volumes.sh being run?

ajfriedman18 commented 6 years ago

Hi @cornhundred,

In the YAML, you can see it here in Isaac's job definition

IsaacJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: !Join ["-", ["isaac", !Ref Env]]
      Type: container
      RetryStrategy:
        Attempts: !Ref RetryNumber
      ContainerProperties:
        Image: !Ref IsaacDockerImage
        Vcpus: !Ref IsaacVcpus
        Memory: !Ref IsaacMemory
        JobRoleArn: !Ref JobRoleArn
        MountPoints: 
          - ContainerPath: "/scratch"
            ReadOnly: false
            SourceVolume: docker_scratch
        Volumes:
          - Name: docker_scratch
            Host:
              SourcePath: "/docker_scratch"

Effectively, you define the source path on your instance under Volumes and then you define the container mount point under MountPoints.

mount_volume.sh is the code we used to help create the Golden AMI. You can find this in part 3 of the blog series.

We do need to make this more clear though in the GH repo...will see what I can do.

cornhundred commented 6 years ago

Hi @ajfriedman18,

Thank you for the help. I made an AMI using the instructions on the blog and I think it mounted the 1TB EBS volume correctly. I can ssh into a running EC2 instance of the AMI and see the 1TB docker_scratch volume with df -h

[ec2-user@ip-######### docker_scratch]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  703M  7.0G   9% /
devtmpfs        489M   84K  489M   1% /dev
tmpfs           497M     0  497M   0% /dev/shm
/dev/xvdb       985G   72M  935G   1% /docker_scratch

From what I understand I now need to use this custom AMI to run the batch jobs.

Do the MountPoints and Volumes container properties have a similar function to running a docker container and passing in an external volume? Also, will all submitted jobs will share this common docker_scratch directory (e.g. multitenancy)?

cornhundred commented 6 years ago

Also, do we need to specify the memory available for the AMI? I see that we can specify the memory available for a job definition, but does the same need to be done when making the AMI (e.g. when selecting the t2.micro instance type to launch)? Or does the 'managed compute environment' take care of this?

ajfriedman18 commented 6 years ago

@cornhundred, yes all jobs will share an external volume in the scenario we built. However, the individual Docker containers have a python wrapper that creates a unique subdirectory in the volume so you won't have any file clashes.

You shouldn't need to specify the memory available for the AMI or instance. The closest you get to this is in defining your compute environment instance types. Beyond that, Batch handles the rest. It'll look at your Job Definition and 1) see if any instances already in the CE have resources available to run the job, and if not 2) spin up a new instance that can meet the resource requirements you've specified.

cornhundred commented 6 years ago

Hi @ajfriedman18

Thanks for the clarification. We were able to get jobs running on our end that uses your shared scratch directory set-up and batch managed the computer environments took care of selecting instances with sufficient memory.

Best, Nick

nikita-sheremet-java-developer commented 6 years ago

Hello, @cornhundred

The blog show only the way how to add ebs via web console. As I understand posts before - you have to lanuch EC2 first and then configure it with SSH. No way to so automatically via cloudformation template?

ajfriedman18 commented 6 years ago

Hi @nikita-sheremet-java-developer. As of now, AWS Batch does not allow for attaching EBS volumes at instance launch, which is why creating the Custom AMI is a required step. Though if you'd prefer to script it, you could likely create write a simple python or shell script with the CLI to create the custom AMI.

alartin commented 6 years ago

A feasible solution is to use EFS as the shared storage for genomic pipeline especially for any reference data like genome sequence, databases index such as blast index and bowtie index and build you own AMI on top of EC2 with EFS mount. Of course you need to create EFS first and mount it to the EC2 which is used for your customer AMI. Then pass the AMI to AWS Batch.

vfrank66 commented 5 years ago

Here is a cloudformation that will add data to the EC2 launch, https://github.com/vfrank66/awsbatchlaunchtemplate