Feature request: run bcbio on awsbatch

cariaso commented 4 years ago

bcbio currently supports several schedulers (lsf, sge, torque, slurm, pbspro) https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html

aws parallelcluster also supports several schedulers (sge, torque, slurm, awsbatch) https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#scheduler

in both cases, it is possible to switch between the different schedulers via a trivial change to their respective settings.

However aws parallelcluster supports a scheduler named awsbatch, which is not supported by bcbio. This scheduler has compelling features (autoscaling) and has a command line interface which will be immediately familiar to anyone who's used other schedulers https://docs.aws.amazon.com/parallelcluster/latest/ug/awsbatchcli.html (awsbsub = sge's qstat / lfs's bsub, awsbstat, awsbkill, awsbqueues, awsbhosts)

example usage visible at https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_03_batch_mpi.html

The existing bcbio scheduler support isn't actually in bcbio. It's in ipython/ipyparallel however the code to support each of the schedulers is ~100 lines each, https://github.com/ipython/ipyparallel/blob/f2c10970b782b60b2f2ec04e95ab88d204d6c6b9/ipyparallel/apps/launcher.py#L1352 and it seems likely that awsbatch would be extremely similar.

Related to https://github.com/bcbio/bcbio-nextgen/issues/3022 but a perhaps simpler way to get a good enough result without some of the CWL complexity.

roryk commented 4 years ago

Thanks so much @cariaso, I didn't realize AWS batch worked like that. Do you have experience using AWS batch like this? What has to happen to setup the virtual cluster in the first place?

bcbio is using https://github.com/roryk/ipython-cluster-helper to do bcbio-specific configuration of the ipyparallel launchers, so we could easily add aws batch support there.

cariaso commented 4 years ago

I didn't realize AWS batch worked like that.

I think there is some naming confusion that makes this non-obvious. https://aws.amazon.com/batch/ is a service like ec2, s3 etc.

https://pypi.org/project/aws-parallelcluster/ is software, with some lineage to CfnCluster and possibly http://star.mit.edu/cluster/ before that. Part of this software introduces a scheduler semi-confusingly named awsbatch, which allows use of the AWS Service named Batch, through a more familiar interface. https://docs.aws.amazon.com/parallelcluster/latest/ug/awsbatchcli.html

Do you have experience using AWS batch like this?

No. I'm sort of new here too, but not exactly. ~7 years ago I used http://star.mit.edu/cluster/ to great effect. I hadn't touched it in years, but have continued to use bcbio off and on in other contexts. I think during that time I migt have used starcluster + bcbio, but honestly its been so long I'm just not sure. I was dimly aware of parallelcluster ever since https://bioteam.net/2018/11/aws-launches-parallelcluster-retires-cfncluster/ but didn't touch it until ~3 days ago when my frustration with #3022 and https://docs.opendata.aws/genomics-workflows/orchestration/step-functions/step-functions-overview/ finally forced me to take a step back. I've not yet taken this far, and don't actually have a full run of bcbio on top of parallelcluster yet, but it looks like a very nice pairing, and worth some energy before revisiting other approaches, or sarek.

What has to happen to setup the virtual cluster in the first place?

From a freshly booted aws linux 2:

# to set the aws_access_key_id & aws_secret_access_key, (possibly unneeded?)
aws configure

# should also be doable via python2
yum install -y python3-pip
pip3 install aws-parallelcluster

pcluster configure

it will prompt for these values

AWS Region ID
EC2 Key Pair Name
Scheduler = awsbatch
Minimum cluster size  #default blank = 0 is fine
Maximum cluster size (vcpus)     #default blank = 10 is fine
Master instance type

I'm still experimenting, but it seems to be ok to use t2.micro during early testing but on real clusters, the master can often be quite busy serving NFS to worker machines so consider m5a.xlarge or similar

Automate VPC creation? (y/n)

it's possible to use re-use an existing VPC, but for now it's nice to let it handle this for you it will also create 2 subnets. Most of this is managed via a cloudformation, and will be cleaned up automatically, but the VPC will remain, so a manual deletion is eventually warranted

it will now spend ~1 minute creating your VPC and subnets. It may be helpful to keep an eye on aws console VPC & Cloudformation during this work

# This will populate
# /home/ec2-user/.parallelcluster/config
# a few other things to consider about it, documented at
# https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html
# -------------- begin customizations of /home/ec2-user/.parallelcluster/config
[vpc default]
# use_public_ips = 
# whether or not to use an elastic IP, optional, since master has an sshable public IP either way

[cluster default]
scheduler = awsbatch
cluster_type = spot          
base_os = alinux2               # amazon linux 2
# master_root_volume_size = 500
# s3_read_resource = arn:aws:s3:::bucketname/cluster/ro/*
# s3_read_write_resource = arn:aws:s3:::bucketname/cluster/rw/*
ebs_settings = bcbio, shared
# the first time you do this, you can omit ebs_settings and the [ebs xxx] below
# and you'll find a 20G /shared on master being nfs mounted to all workers
# I've previously run bcbio_nextgen_install.py and installed
# everything (tools & data) onto a single volume, and then taken a snapshot
# so using the above above, plus below to do what I do for later clusters

[ebs bcbio]
shared_dir = /mnt/bcbio
ebs_snapshot_id = snap-0f1234567890

[ebs shared]
shared_dir = /shared
volume_size = 500 

# all of the above is still new, and relatively untested, but it is at least close to right
# and is similar to what I did with http://star.mit.edu/cluster/ many years ago
# -------------- end customizations of /home/ec2-user/.parallelcluster/config

The non-mpi top portion of https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_03_batch_mpi.html will probably answer your remaining questions well enough, but I'm happy to support your next steps as a dev or tester as suits you.

cariaso commented 4 years ago

AWS has announced the deprecation of SGE and Torque in parallelcluster. https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster slurm and awsbatch remain, making this not a total loss, but still noteworthy.

fwiw, I did try to run bcbio under parallelcluster, but it failed for me, always with a message akin to "0 engines running". I am however able to run 'hello world' style sge jobs under the same parallelcluster setup with no issue. I've not yet had sufficient progress in pushing past that point.

bcbio / bcbio-nextgen

Feature request: run bcbio on awsbatch #3207