Michael G. Campana and Ellie E. Armstrong, 2019-2024
Smithsonian Institution
Stanford University
Pipeline to calculate de novo mutation rates from parent-offspring trios
This README provides basic details for installing, configuring and running the pipeline. Please note that as of version 1.0.0, RatesTools has upgraded to Nextflow DSL2. For the original DSL1 pipeline, please see versions <=0.5.16. Detailed documentation is available for the Ruby and R scripts included in this package and for the pipeline's operation. Test data are provided in the Smithsonian Institution Figshare repository and a tutorial is available here.
To the extent possible under law, the Smithsonian Institution and Stanford University have waived all copyright and related or neighboring rights to RatesTools; this work is published from the United States.
We politely request that this work be cited as:
Armstrong, E.E. & M.G. Campana. 2023. RatesTools: a Nextflow pipeline for detecting de novo germline mutations in pedigree sequence data. Bioinformatics. 39: btac784. DOI: 10.1093/bioinformatics/btac784.
Preprint available on bioRxiv. DOI: 10.1101/2022.07.18.500472.
We provide a configuration profile "conda" in the default configuration file (nextflow.config
) that installs all dependencies using Conda. As of RatesTools 1.0.0, we recommend (and default to) the use of Mamba for environment construction. Using this profile, the user only needs to install Nextflow [1], Conda/Mamba and the RatesTools pipeline:
Install Nextflow: curl -s https://get.nextflow.io | bash
Install Conda (and/or Mamba): See installation instructions here and here
Pull the current version of the RatesTools pipeline: nextflow pull campanam/RatesTools -r main
We explicitly list software dependencies here as no installation system (e.g. via Conda or containerization) is universally supported across all computing architectures.
RatesTools requires Nextflow [1] v. >= 23.10.0, Ruby v. >= 3.2.2, R [2] v. 4.0.2 and Bash v. >= 4.2.46(2)-release. Basic instructions for installing these languages are copied below. We recommend installing Ruby using the Ruby Version Manager. See the official language documentation should you need help installing these languages.
Install Nextflow: curl -s https://get.nextflow.io | bash
Install the latest Ruby using Ruby Version Manager: curl -sSL https://get.rvm.io | bash -s stable --ruby
Install R: Use the appropriate precompiled binary/installer available at the Comprehensive R Archive Network (CRAN).
Pull the current version of the pipeline: nextflow pull campanam/RatesTools -r main
To specify another RatesTools release, replace main
with the RatesTools release version (e.g. v0.5.7
).
RatesTools requires the following external dependencies. See the documentation for these programs for their installation requirements. RatesTools requires the Genome Analysis Toolkit (GATK) [3] v. 3.8-1 or v. >= 4.4.0.0 and Java v. 1.8 (GATK3) or v. 1.17 (GATK4). Currently, RatesTools is not compatible with other versions of Java. Otherwise, listed versions are those that have been tested and confirmed, but other versions may work. RatesTools can utilize Environment Modules modulefiles to simplify deployment on computing clusters and limit dependency conflicts (See the tutorial).
RatesTools requires the following R packages installed in your R environment:
To assist installation and execution of the Java dependencies, we provide built-in options to install GATK and Picard through Conda. See the tutorial for details.
Assisted configuration of the RatesTools pipeline can be accomplished using the configure.sh
bash script included with this repository. The script copies the nextflow.config
included with this repository and modifies the copy for the target system. The configure.sh
script detects software installed on the local system and prompts the user to provide modulefiles, paths to undetected files, and program options. The configuration file can also be manually edited using a text editor. However, please note that the configure.sh
script requires an unmodified nextflow.config
file to work.
NB: The most straightforward method to obtain the configure.sh
and nextflow.config
files is to clone this repository and move the files to a desired location:
Clone the repository: git clone https://github.com/campanam/RatesTools
Move the files: mv RatesTools/*config* /some/path/
Change to the specified directory: cd /some/path
Execute the script: bash configure.sh
To specify sample and library information to RatesTools, provide a CSV with the following header and information:
Sample,Library,Read1,Read2
\<samp1>,\<lib1>,\<lib1.R1.fq.gz>,\<lib1.R2.fq.gz>
\<samp2>,\<lib2>,\<lib2.R1.fq.gz>,\<lib2.R2.fq.gz>
\<samp2>,\<lib3>,\<lib3.R1.fq.gz>,\<lib3.R2.fq.gz>
...
Sample
designates the unique sample name. Library
is the unique library name (multiple libraries can correspond to the same sample). Read1
and Read2
are the forward and reverse read files (FASTQ format) respectively.
RatesTools assumes bidirectional sequencing for each library, but allows for multiple sequenced libraries per individual. RatesTools will merge the libraries by sample name assuming the libraries are independent. If an individual library has been sequenced multiple times, concatenate the reads from the library and treat as a single bidirectionally sequenced file.
Given the wide-variety of computing architectures and operating systems, we cannot provide specific optimized configurations for your computing system. The nextflow.config
file includes an example of a 'standard' configuration profile for a local installation using modulefiles and a 'conda' configuration that installs all dependencies using Conda. Example configuration profiles for the analyses described in Armstrong & Campana 2023 are provided in the Figshare repository. Please consult your computing staff to optimize the profile settings for your hardware. We recommend storing configuration profiles in a system-wide central location for access by all users.
Enter nextflow run campanam/RatesTools -r <version> -c <config_file>
to run the pipeline, where version
is the installed RatesTools release. Append -resume
to restart a previous run or -bg
to run RatesTools in the background. If you developed platform-specific configuration profiles, you can specify this using the -profile <PROFILE>
option. See the Nextflow documentation for details. Final data are written to the specified output directory and its subdirectories.
Image Credit: Conor Mallon. 2014. Smithsonian's National Zoo & Conservation Biology Institute. Smithsonian Institution. https://nationalzoo.si.edu/object/nzp_NZP-20141024-032CPM.