AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.

Environment Setup

Nextflow

Most of the workflows in this repository are run via Nextflow.

Installing Nextflow

You can install Nextflow by following the instructions on their page. Be sure to follow the second step of their instructions: move the nextflow file to a directory in your $PATH for easy access.

Alternatively, you can install nextflow via conda or brew (it may also be available in other package managers):

Nextflow Tower

Optionally, you can track your nextflow workflows in progress with Nextflow Tower. To use this, you will need to create an account on https://tower.nf/ (using your github account is likely simplest), then get your token value from https://tower.nf/tokens. You will then want to create an environment variable with this token value: export TOWER_ACCESS_TOKEN=<MYTOWERTOKEN> This line can be entered at the command line, but you will likely want to add it to your shell initialization file (~/.bash_profile, ~/.zshrc, etc.)

Docker

All of the workflows are designed to run via Docker containers for each tool. It is therefore critical that you have Docker installed for any local execution. For downloads and installation, go to https://www.docker.com/products/docker-desktop

AWS

The tools in this repository are designed primarily to run on AWS via the AWS Batch system, with data stored in S3. To use these tools as they are currently implemented, you will therefore need an AWS account, with access to the Alex's Lemonade CCDL organization account.

Installing AWS Command line tools

Once you have your account information, you will want to install the AWS command line interface on your local machine. You should be able to use either version 1 (>v1.17) or version 2 of the AWS command line tools, but these instructions have been primarily tested with v1.18. To install, you can follow amazon's instructions for v2 or install via your favorite package manager (pip and conda will at this time install v1, homebrew will install v2):

Configuring your AWS credentials

When you set up your account, you may have gotten a paired AWS Access Key ID and AWS Secret Access Key. If you did not, you can login to the aws console, select "My Security Credentials" from the menu that appears clicking on your username in the upper right. Scroll down to "Access keys for CLI, SDK, & API access" and click the "Create Access Key" button. You will see a popup with your Access key ID and a link to show the Secret access key.

In the terminal, run the command aws configure and when prompted paste in the AWS Access Key ID and AWS Secret Access Key. You may want to set up your Default region name: most of the computing resources used here run in us-east-1, but setting this to a different region should not affect most commands. The Default output format can be left as None

The aws configure command will create a directory at ~/.aws with a config file and a credentials file that will be required for nextflow commands to work smoothly with Batch and S3.

AWS infrastructure

The infrastructure on AWS (Batch queues, AMIs, etc.) is defined via terraform using the files in the aws directory. More details are described in aws/setup-log.md

Running Workflows

The workflows for this repository are stored in the workflows directory, organized by task into subdirectories. In general, workflows should be run from the main workflows directory, rather than the subdirectories. This will allow easy access to the shared nextflow.config file and keep all intermediate and log files in a single location. In particular, nextflow creates a separate work directory for every task: these will appear by default (for local tasks) in the workflows/work directory if the command is invoked from workflows. As the work directories can get large, it is helpful to have a single location to keep track of and purge as needed.

Final output file locations are determined on a per-workflow basis, usually by the params.outdir setting within the script (possibly overridden by the --outdir option). In most cases this will be an S3 bucket within s3://nextflow-ccdl-results.

Running locally

⚠️⚠️⚠️ Running workflows in this repository locally is not something to take lightly! It is fine for a quick test, and useful for development, but note that most of the workflows here use very large data files, which will have to be downloaded locally to run. Workflow processing may require large amounts of RAM and time, so the following example commands will rarely be used, and are mostly for illustration. ⚠️⚠️⚠️

The basic command to run a workflow will look something like the following:

nextflow run checks/check-md5.nf

This will run the workflow locally, using the default parameters as defined in the workflow file.

In most cases, you will want to skip any cached steps that have already run: this can be done by adding the -resume flag.

nextflow run checks/check-md5.nf -resume

⚠️ Again, you probably don't want to run locally unless you have a good reason and know the limitations! ⚠️ If you do run workflows locally (only recommended for testing!), keep in mind that Nexflow will create a work directory in your current directory to store input, output, and intermediate files for the workflow. This work directory can get large fast, so you will want to periodically purge the subdirectories and/or delete the entire directory. Doing so will temporarility eliminate the benefits of -resume, but this is the price we pay.

Running on AWS Batch

To run the same workflow on AWS Batch, make sure your credentials are configured as described above, and then run with the batch profile, which has been configured in nextflow.config for the CCDL infrastructure.

nextflow run checks/check-md5.nf -profile batch -resume

(Note that the effectiveness -resume flag depends on location: locally cached steps will still have to run on AWS, but if they are cached on AWS, they will be skipped.)

If you have set up Nextflow Tower, you can add a flag to send progress information there:

nextflow run checks/check-md5.nf -profile batch -resume -with-tower

Finally, if you want to change any of the parameters that are defined in a workflow, you can do so at the command line using flags that start with with a double dash --. For example, to use different run ids for the alevin workflow, you might use:

nextflow run checks/check-md5.nf -profile batch -resume --run_ids SCPCR000003,SCPCR000004