This repository contains our Docker-compose and setup bootstrap scripts used to create a deployment of the UCSC Genomic Institute's Computational Genomics Platform for AWS. The system is designed to receive genomic data, run analysis at scale on the cloud, and return analyzed results to authorized users. It uses, supports, and drives development of several key GA4GH APIs and open source projects. In many ways it is the generalization of the PCAWG cloud infrastructure developed for that project and a potential reference implementation for the NIH Commons concept.
The system has components fulfilling a range of functions, all of which are open source and can be used independently or together.
These components are setup with the install process available in this repository:
These are related projects that are either already setup and available for use on the web or are used by components above.
These directions below assume you are using AWS. We will include additional cloud instructions as dcc-ops
matures.
Make sure you have:
us-west-2
Use the AWS console or command line tool to create a host. For example:
We will refer to this as the host VM throughout the documentation below and it is the machine running all the Docker containers for each of the components below.
You should make a note of your security group name and ID and ensure you can connect via ssh.
NOTE: We have had problems when uploading big files to Virginia (~25GB). If possible, set up your AWS anywhere else but Virginia.
Make sure you do the following:
Here is a summary of what you need to do. See the Redwood README for details.
Redwood exposes storage, metadata, auth services. Each of these should be made subdomains of your "base domain".
Now we're ready to install Redwood.
See the Consonance README for details. Consonance assumes you have an SSH key created and uploaded to a location on your host VM. Other than that, there are no additional pre-setup tasks.
Add your private ssh key under ~/.ssh/<your_key>.pem
, this is typically the same key that you use to SSH to your host VM, regardless it needs to be a key created on the AWS console so Amazon is aware of it. Then do chmod 400 ~/.ssh/<your_key>.pem
so your key is not publicly viewable.
Follow the instructions here to create an AMI for the worker node. Use an ubuntu 14.04 base box. You can use the official Ubuntu release. You may need to make your own AMI with more storage. Make sure you make it in the same region where your VM and S3 buckets are located.
You probably want to install the Consonance command line on the host VM so you can submit work from outside the Docker containers running the various Consonance services. Likewise, you can install the CLI on other hosts and submit work to the queue.
Download the consonance
command line from the Consonance releases page:
https://github.com/Consonance/consonance/releases
For example:
wget https://github.com/Consonance/consonance/releases/download/2.0.0-alpha.15/consonance
sudo mv consonance /usr/local/bin/
sudo chmod a+x /usr/local/bin/consonance
# running the command will install the tool and prompt you to enter your token, please get the token after running install_bootstrap
consonance
Follow the interactive directions for setting up this CLI. You will need the elastic IP you setup previously (or, better yet, the "base domain" from above).
Here is a summary of what you need to do. See the Boardwalk README for details.
ElasticSearch requires that you set vm.max_map_count
to at least 262144. The bootstrap installer will take care of this. However, the changes are not permanent, and if you restart your VM, vm.max_map_count
will change to its default. To make this change permanent, edit the file /etc/sysctl.conf
on your VM and add/edit this line: vm.max_map_count=262144
. This will make the change permanent even in the case the VM is restarted.
You need to create a Google Oauth2 app to enable Login and token download from the dashboard. If you don't want to enable this on the dashboard during installation, simply enter a random string when asked for the Google Client ID and the Google Client Secret. You can consult here under "Creating A Google Project" if you want to read more details. Here is a summary of what you need to do:
http://<YOUR_SITE>
. Press Enter. Add a second entry, same as the first one, but use https instead of httphttp://<YOUR_SITE>/gCallback
. Press Enter. Add a second entry, same as the first one, but use https instead of httpPlease note: at this point, the dashboard only accepts login from emails with a 'ucsc.edu' domain. In the future, it will support different email domains.
Once the above setup is done, clone this repository onto your server and run the bootstrap script
# note, you may need to checkout the particular branch or release tag you are interested in...
git clone https://github.com/BD2KGenomics/dcc-ops.git && cd dcc-ops && sudo bash install_bootstrap
The install_bootstrap
script will ask you to configure each service interactively.
/sbin/ifconfig
. The device to use is the one associated with the private IP address of your AWS VM.c4.8xlarge
. Support for more instances will come in the future. dev
mode will use letsencrypt's staging service, which won't exhaust your certificate's limit, but will install fake ssl certificates. prod
mode will install official SSL certificates. What is the Consonance access token?
enter your Consonance access tokenWhat is the AWS Access key ID?
, your AWS key used for storage systemWhat is the AWS secret access key?
, your AWS secret key used for the storage systemWhat is the AWS profile?
, your AWS usernameWhat is the AWS region?
, your AWS regionWhat is your Redwood endpoint?
, enter the endpoint for the storage system, e.g. myurl.com
. Referred to as base URL
aboveWhat is your Redwood Access Token?
, enter your storage system access tokenWhat is your Elastic Search endpoint?
, enter your Elastic Search endpoint, e.g. elasticsearch1
What is your Elastic Search endpoint port?
, enter the port number, e.g. 9200
What is your AWS S3 touch file bucket?
, enter the name of the AWS bucket where touch files will be writtenOnce the installer completes, the system should be up and running. Congratulations! See docker ps
to get an idea of what's running.
Here are things we need to explain how to do post install:
sudo redwood/cli/bin/redwood token create -u email@ucsc.edu -s 'aws.upload aws.download'
, this give access to all programs.sudo redwood/cli/bin/redwood project create PROJECT
sudo redwood/cli/bin/redwood token create -u email@ucsc.edu -s 'aws.PROJECT.upload aws.PROJECT.download'
test/rnaseq-cgl-refdata
sudo docker run --rm -it -e ACCESS_TOKEN=
cat token.txt-e REDWOOD_ENDPOINT=ops-dev.ucsc-cgl.org -v $(pwd)/outputs:/outputs -v
pwd:/dcc/data quay.io/ucsc_cgl/core-client:1.1.0-alpha spinnaker-upload --force-upload --skip-submit /dcc/data/manifest.tsv
sudo docker exec -it boardwalk_dcc-metadata-indexer_1 bash -c "/app/dcc-metadata-indexer/cron.sh"
To test that everything installed successfully, you can run cd test && ./integration.sh
. This will do an upload and download with core-client and check the results.
Make sure you have the consonance CLI installed.
Make a run.json
{
"input_file": {
"class": "File",
"path": "https://raw.githubusercontent.com/briandoconnor/dockstore-tool-md5sum/master/md5sum.input"
}
}
consonance run --tool-dockstore-id quay.io/briandoconnor/dockstore-tool-md5sum:1.0.3 --flavour r3.8xlarge --run-descriptor run.json
# and it produces this
"job_uuid" : "66a67327-ccd3-4af0-a5c8-688fb52da778"
# you can check the status
consonance status --job_uuid 66a67327-ccd3-4af0-a5c8-688fb52da778
End users should be directed to use the quay.io/ucsc_cgl/core-client
docker image as documented in its README.
The test/integration.sh
file also demonstrates normal core-client usage.
Here is a sample command you can run from the test
folder to do an upload:
NOTE: Make sure you create an access token for yourself first. You can do so by running within dcc-ops
the command redwood/cli/bin/redwood token create -u myemail@ucsc.edu -s 'aws.upload aws.download'
. This will create a global token that you can use for testing for upload and download on any project. End users should only be given project-specific scopes like aws.PROJECT.upload.
sudo docker run --rm -it -e ACCESS_TOKEN=<your_token> -e REDWOOD_ENDPOINT=<your_url.com> \
-v $(pwd)/manifest.tsv:/dcc/manifest.tsv -v $(pwd)/samples:/samples \
-v $(pwd)/outputs:/outputs quay.io/ucsc_cgl/core-client:1.1.0-alpha spinnaker-upload \
--force-upload /dcc/manifest.tsv
Here is a sample command you can run to download the using a manifest file. On the dashboard, go to the "BROWSER" tab, and click on "Download Manifest" at the bottom of the list. Save this file, and run the following command. This will download the files specified from the manifest:
sudo docker run --rm -it -e ACCESS_TOKEN=<your_token> -e REDWOOD_ENDPOINT=<your_url.com> \
-v $(pwd)/<your_manifest_file_name.tsv>:/dcc/dcc-spinnaker-client/data/manifest.tsv \
-v $(pwd)/samples:/samples -v $(pwd)/outputs:/outputs \
-v $(pwd):/dcc/data quay.io/ucsc_cgl/core-client:1.1.0-alpha \
redwood-download /dcc/dcc-spinnaker-client/data/manifest.tsv /dcc/data/
To do RNA-Seq Analysis, you must first upload reference files to Redwood. You can obtain the reference files by running from within dcc-ops:
reference/download_reference.sh
This will download the files under reference/samples
. You can then use the core client to do a spinnaker upload as described previously and use the manifest.tsv within the reference
folder.
Once you have successfully uploaded the reference files, you can start submitting fastq files to redwood to run analysis on them. See the help section on the file browser for more information on the template. Use RNA-Seq
or scRNA-Seq
when filling out the Submitter Experimental Design column on your manifest.
If something goes wrong, you can open an issue or contact a human.