broadinstitute / firecloud_developer_toolkit

Utilities to help algorithm developers dockerize new tasks, debug them, and put them into Firecloud.
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

firecloud_developer_toolkit

The Firecloud Developer Toolkit (FDT) is broken up into several sections which may be used together or independently.

TBD- should crom be renamed to cromlocal or cromdebug?

A recommended way of doing development is to take a hybrid approach between using your laptop and a GCE VM. First, write and edit your code local to your laptop, and optionally sync it from there to GitHub. To test your current version, copy the source code to a GCE VM, and build the docker there and run it. Once you are happy with it, push the docker image from the VM to Dockerhub, and push the WDL from the VM to Firecloud.

Other reasonable approaches include laptop-only (replacing the GCE VM with a Virtualbox or other VM that you have sudo priviledges on) or GCE VM only (doing all of your development work on the VM using emacs/vi or an IDE, using your laptop as a dumb terminal).
Using GCE for running the algorithm has the advantage of being more similar to Firecloud VMs than other VMs, and gives you access to more resources (CPUs, RAM, storage) than are available on your laptop. On the other hand, while it is simpler in some ways to do all of your editing on the GCE VM, it has the downsides of network latency and ongoing billing.

The main workflow described below follows the hybrid laptop-GCE VM approach to create, debug, and deploy a FireCloud module.

Initial setup

These utilities require bash, Python2.7, Git, and Java1.8. For everything to work, these should all be installed and in your path, both on your laptop as well as on the GCE VM. Python 3, currently the default on the Macs, causes gsutil to give strange errors.

The first step is to clone this repo to someplace on your local laptop or Linux VM, and then source the fdtsetroot command.
fdtsetroot sets environment variables, and adds its directories to your path. Use the -s option to also modify your .bashrc.

cd /opt
git clone https://github.com/broadinstitute/firecloud_developer_toolkit.git /opt/fdt
cd /opt/fdt
. fdtsetroot -s

Next, you should install a few utilities to your local environment.

To work with GCE VMs, you need gsutil.

install_gsutil_latest_nr.sh   # root not needed
update_gsutil_latest_nr.sh    # run this instead if gsutil is already installed

To work with Firecloud, you should install Firecloud-specific utilities. Cromwell/wdltool is needed on your laptop to validate and parse your wdl file.

install_cromwell_24_nr.sh          # root not needed.  24 is the current FireCloud version as of March 2017.

fissfc and firecloud cli are needed to use commandline tools to interact directly with Firecloud, eg for loading wdl files, loading data loadfiles, or launching and monitoring workflows.

install_firecloudcli_latest_nr.sh  # root not needed. Python2.7 and Git need to already be in the path.
install_fissfc_latest.sh        # root needed unless you set up a python venv

Create your repo and algorithm template

FireCloud algorithms are encapsulated in Docker containers and run by the Cromwell execution engine. The 'crom' set of tools wraps common Docker and Cromwell commands needed to test your algorithm on a standalone VM. These tools assume certain files and directories exist in your code repo, and that you are using a particular file naming convention for your algorithm. To make it easier to follow these naming conventions, it can generate a basic, fully functioning hello-world task that you can modify.

. createnewrepo /absolute/path/to/your/new/repo  # creates new directory structure, including tasks subdirectory. Run before making your first task.
# This repo is intended to be pushed to GitHub.  First add README.md, .gitignore, and LICENSE files, and then run 'git init'.

. createnewtask mytask # creates a complete hello-world task under /path/to/your/new/repo/tasks/mytask
. setalgdir [<your alg directory>] # sets repo and current alg environment variables.  defaults to cwd

Edit the algorithm template

src/hello.py

Replace this with your own sophisticated algorithm. Also copy any task-specific jar files or libraries into this src directory.

taskdef.method.wdl

See https://github.com/broadinstitute/wdl#getting-started-with-wdl for more info on wdl.

inputtest.method.json

This file is used during non-Firecloud testing to supply arguments for the WDL input parameters.

sourcefiles.method.list

List of files to copy into build directory just before building the Docker image. Used to selectively pull in common files used by multiple tasks.

Set up GCE VM

We will use a GCE VM to build and test your algorithm, so let's create one.

You need a Google billing account set up first, along with a Google Project. The project needs to have priviledges to spin up VMs. The project cannot be the same as your Firecloud project, as Firecloud projects do not allow you to spin up VMs yourself.

First, log into Google from the command-line on your laptop, and choose the project the VM will be billed to. Perhaps choose us-east1-c for the default zone.

gcloud init

Base GCE VM off previously built image

The quickest way to spin up a new VM is off a previously built image that is accessible from your project. It will be ready to go in perhaps half a minute.

create_and_start_vm.sh <vmname> <instance_type> <docker_disk_size>
create_and_start_vm.sh pcawgvm1 n1-standard-2 20

Then, login to the VM. Open a shell from an existing shell window:

ssh_to_vm.sh <vmname>
ssh_to_vm.sh pcawgvm1

After you spin up a VM there is some one-time user-specific setup that is needed on the VM.

gcloud init # log in with your Google credentials, used to access Firecloud and GCS buckets.
docker login # log into dockerhub, for pushing and pulling private images
sudo gpasswd -a ${USER} docker  # add yourself to the docker unix group, needed for crom to work.  Requires relogin to take effect.

Import a previously built image

If you want to import an image into your project that was created by someone else, these basic notes may be useful... TBD make these more complete

Base GCE VM of stock image

If you want to customize your image, you can build your VM image from scratch.

Next, log in to it:

ssh_to_vm.sh <vmname>
ssh_to_vm.sh test1

Then, install the software you need. First install Git, next install the Firecloud Developer Tools, and then use the FDT installers to install the other packages you need. The following packages take several minutes total to install.

cd /opt/fdt . fdtsetroot -s

general

install_python_2.7.sh install_oraclejavajdk_8_debian7bp.sh install_docker_latest_debian7bp.sh #sets docker unix group membership, needs relogin before it takes effect set_timezone_ET.sh

Google

install_gsutil_latest_nr.sh #note - corrupts the ~/.bashrc if run more than once, needs manual editing to fix #needs python2.7

firecloud-specific

install_cromwell_21_nr.sh #needs java # for better correlation, pick the version currently used on FireCloud install_firecloudcli_latest_nr.sh #relies on python venv, java install_fissfc_latest.sh #relies on python pip


## Manage GCE VM

Now that your VM has been created, you can work with it in various ways.  Commandline approaches are described below, though
many of the same things can be accomplished via the gui available from console.cloud.google.com, or from the "Cloud Console" 
mobile app from Google.

stop_vm.sh #powers down the VM so you don't get compute charges for it, only storage charges start_vm.sh #wakes up a VM that was previously stopped, assigns new IP address reset_vm.sh #reboots VM expand_disk.sh # enlarges docker_disk, can do either while VM is running or stopped. Cannot shrink disk. delete_vm.sh #purges VM and docker_disk, halting charges for both compute and storage.

stop_vm.sh pcawgvm1 start_vm.sh pcawgvm1 reset_vm.sh pcawgvm1 expand_disk.sh pcawgvm1 50 delete_vm.sh pcawgvm1


#### Info
To see what VMs you have, both running and halted, use:

list_vm.sh #gives name, state, and ip addresses of your VMs

* You can use the external IP address listed here to attach to the VM, via the commandline or IDE. You should authenticate via 
the ~/.ssh/google_compute_engine key you created, instead of a password.

#### File transfer

##### IDE-based file transfer
IDE's can be configured to upload files it changes whenever you hit save, using the same ip address and ssh key.  In 
Pycharm, this is set under Tools -> Deployment -> Configuration
##### rsync-based file transfer
When initially attaching to the VM, or after making changes outside of the IDE, you can rsync to make sure that all the local files live on the VM.

rsync_to_vm.py [ ...] rsync_to_vm.py pcawgvm1 ~/pcawg /opt

* As ususal, rsync copies only files it feels it needs to.  Note that this does not delete any files on the VM, or 
(I think) overwrite files at the destination that are newer.  If you want things to be fresh, manually delete things on 
the VM side before performing the rsync
* Note that transfers into GCE are free, while transfers out are expensive if TB-scale.  
* Note that rsync to a VM is much slower than rsync into a bucket, as it runs serially; if you are trying to upload 
many GB it can take a while.  If you are uploading TB's of data, you probably want to use a bucket.
* For code, it may be cleaner to transver via GitHub, eg do `git push origin master` from your local computer 
and `git pull` from the GCE VM, though this involves an extra step than using rsync.

##### bucket-based file transfer
_TBD add details_
* boto file editing
* gsutil cp
* gsutil mb
* pseudo-directories

#### Pricing summary
You can find the cost of various things here: https://cloud.google.com/compute/pricing
* A 2 core standard memory (7.5GB) vm is $0.10/hr, $2.40/day.
* The biggest vm, 32 core highmem (208GB) is $2/hr, $48/day.
* Preemptible or long-running VMs are cheaper.
* VM disk space is $40/TB/mo based on capacity, not space used.
* Bucket storage is $20 or $26/TB/mo for stuff directly accessible, or $10/TB/mo that could work for things you will access less than 1/mo.
* Data egress out of Google can be as much as $120/TB

## Running your algorithm on the VM
Once you have your VM set up and your algorithm repo copied to it, you need to let the FDT know where it lives:

. setalgdir


## Build image
To build the docker image for your current algorithm, run:

build

* This creates a build subdirectory, assembles your Dockerfile and source files there, and runs docker build on it.  
* If the docker build appears to fail intermittantly, it may be because the way docker caches its builds.  Best practice 
recommendation is to include update and install on the same line, eg 'apt-get update && apt-get install -y foo'. To build
without using the cache, run ```buildclean```, and expect it to take longer.

### Run
Runs your image locally.

runcromwell [inputtest] killrun

* `runcromwell` runs your docker image via Cromwell using &, so that things don't die if your ssh connection times out.  
If you want to abort it, you need to do `killrun` to kill cromwell and the container.
* `runcromwellfg` runs in the foreground, without using &, so ctrl-C can also abort it.
* If you want to use a different input than the default inputtest.<method>.json, you can specify the inputtest filename (full directory not needed)
* gs:// and http:// files are localized to <method>/localized_inputs, and cached there between runs.
* Output files are written to deep subdirectories under ./cromwell-executions.
* Links are dropped at /opt/execution, /opt/inputs pointing to the latest Cromwell run.  In addition, /opt/src is linked 
to the build/src directory created when you built the docker.

### Troubleshoot on VM
#### Work with output directory during or after run
Cromwell uses a mount for the output directory of the docker image, so you can view it during and after the job is running.
* ```cd `execution` ``` to cd into the latest output directory. (part of the path is a uuid that distinguishes repeated runs). 
* /opt/execution, /opt/inputs, and /opt/src are handy symlinks.
* ```cd `algdir` ``` to cd into the base task directory
* Analyze the `dstat.log.txt` file emitted from process_monitor.py to figure out how to size the VM (cores, RAM, disk)
* `clearoutputs` wipes the outputs from previous runs - they can get bulky.
* See `expand_disk.sh <vmname> <new_docker_disk_size>` for giving yourself extra disk space if you are running out of room.
#### Work inside container
* `attach` gives you a bash prompt inside the currently running container.  (Only works if exactly one container is running). 
* `runbash` to start your container outside of Cromwell, and give you a bash prompt.  This also mounts 
/opt/execution and /opt/inputs  inside the container to point to these same directories outside the container, which
are in turn linked to the the most recent run of Cromwell.  To pick up where the previous run left off, look at the 'script' file
in the output directory to learn what the initial command line was, and drop symlinks at the original inputs and execution directory
pointing to these directories. for example:

mkdir -p /root/mymethod_workflow/eedf0dde-c686-4bfd-bc45-b50f182624ea/call-mymethod ln -s /root/mymethod_workflow/eedf0dde-c686-4bfd-bc45-b50f182624ea/call-mymethod/execution /opt/execution ln -s /root/mymethod_workflow/eedf0dde-c686-4bfd-bc45-b50f182624ea/call-mymethod/inputs /opt/inputs

### Release

pushimage # uploads your image to a repo such as dockerhub, based on the info given in taskdef.method.wdl pushwdl # uploads your taskdef.method.wdl to Firecloud, based on the info given in the task's Makefile.

## Wire into Firecloud
* note in method repo
* export or import to move to workspace
* import dummy sample, attach it
* run on sample
* modify to use sample annotation point to small yext file
* rerun on sample
* push new version of wdl 
* note how version stamp showd up
* push change to workspace, update, rerun

## Porting Tasks from Firehose

### Defining the docker
Typically the hydrant file will indicate what language package is needed in the commandline, eg via something like use R-2.15, or by calling out \<matlab2009a\>.
Based on this info, check the common/dockerfiles directory for a suitable match, and put them into your dockerfile.\<method\>.list file.  Note that .docker files with 'base' in the name must only be in the list file once, at the start.
Other .docker files can be added to the .list file to load third party packages or languages.  The .docker file loading your own module from \<method\>/src should be listed last, to best take advantage of docker build caching.

### Generating the command line for non-scatter-gather
The common/utils/firehose_module_adaptor contains utilities to make it easier to port existing Firehose modules into a Firecloud docker container.  The utilities are designed to be coupled only to Firehose, and not to Firecloud or Docker.  

run_module.py is used for normal (ie non-scatter-gather) jobs, enabling you to call the module via a simple command-line, and uses information given in the hydrant and manifest files.  

local_scatter_gather.py is used for scatter-gather jobs, though it is probably mostly useful as an intermediate step.  While it similarly enables you to call the module with a simple commandline,
 all jobs are executed on the same node, and the scatter jobs are run sequentially.  For better throughput you would want to specify scatter-gather in the WDL workflow so that multiple nodes can be used. 
 At some point in the future this utility will probably be enhanced to generate such WDL for you.

To use run_module.py, you copy code from Firehose verbatim, eg unpack the zip file you get from exporting the Firehose task.  This directory is then passed in as the module_libdir argument.
All the other arguments are the ones from the manifest, using the names visible to the user in Firehose.  All parameters are treated as if they were labeled optional, regardless of what the manifest says.

Below is an example showing how the commandline is generated for running the ContEst module. 

module_libdir = os.path.join(module_dir,'contest') cmdStr = os.path.join(adaptor_dir,'run_module.py') cmdStr += ' --module_libdir ' + module_libdir + ' '

cmdStr += ' '.join(['--reference.file', os.path.join(refdata_dir,'public','human_g1k_v37_decoy.fasta'), '--intervals', os.path.join(refdata_dir,'public','gaf_20111020+broad_wex_1.1_hg19.bed'), '--sample.id', os.path.basename(bam_tumor), '--clean.bam.file', bam_tumor, '--normal.bam',bam_normal, '--genotypes.vcf none', '--pop.vcf', os.path.join(refdata_dir,'protected','hg19_population_stratified_af_hapmap_3.3.fixed.vcf'), '--force.array.free true', '--snp.six.bed', os.path.join(refdata_dir,'public','SNP6.hg19.interval_list'), '--job.spec.memory','2', '--tmp.dir', tmp_dir #TBD job spec memory was 2; manifest also changed. ])



### Generating the commandline for scatter-gather jobs

local_scatter_gather.py can be used for scatter-gather jobs, though it is probably mostly useful as an intermediate initial step.  While it similarly enables you to run the module with a simple commandline, 
all jobs are executed on the same node, and the scatter jobs are run sequentially.  

For better throughput you would want to specify scatter-gather in the WDL workflow so that multiple nodes can be used.  

At some point in the future this utility will probably be enhanced to generate such WDL for you.
run_sg_prepare.py, run_sg_scatter.py, and run_sg_gather.py are utilities written assuming the outputs live under a common root directory.  They will likely be tweaked as part of this future enhancement.

### Using the Pipette scheduler
algutil/pipette_server contains a scheduler that can keep the cores busy on a single VM.

## Future plans

#### gce
suspend
idle vm detection/notification
migrate to different sized vm
budget setting/notifications

#### install
x11, vnc
vpn
matlab interactive
gcs-fuse
openssl
rsub

#### crom
multi-task workflow support
move to inline python for wdl command script

#### Firecloud management