CFSAN-Biostatistics / CSP2

A Nextflow pipeline for fast and accurate SNP distance estimation from WGS read data or genome assemblies
Other
4 stars 2 forks source link

drawing

CFSAN SNP Pipeline 2 (CSP2)

Dr. Robert Literman

Office of Analytics and Outreach
Center for Food Safety and Applied Nutrition
US Food and Drug Administration

Current Release: v.0.9.5 (Oct-17-2024)
Last Push: Oct-17-2024

Important Note: CSP2 is currently under development, and has not been validated for non-research purposes. Current workflows and data processing parameters may change prior to full release version.

CSP2 is a Nextflow pipeline for rapid, accurate SNP distance estimation from assembly data

CSP2 runs on Unix, with the handful of dependencies listed in the Software Dependencies section. CSP2 was developed to (1) improve on the speed of the CFSAN SNP Pipeline (CSP), (2) to reduce computational burden when analyzing larger isolate clusters, and (3) to remove the dependency for raw Illumina data. CSP2 relies on the accurate and rapid alignment of genome assemblies provided by MUmmer, which typically complete within seconds. This provides significant reductions in runtime compared to methods that rely on read mapping. The use of assemblies in place of sequencing data also means that:

CSP2 runs are managed via Nextflow, providing the user with an array of customizations while also facilitating module development and additions in future releases.

Important Note: The software continues to be focused on the analysis of groups of bacterial genomes with limited evolutionary differences (<1000 SNPs). Testing is underway to determine how the underlying cluster diversity impacts distances estimates.

CSP2 has two main run modes (See Examples):

1) "Screening Mode" (--runmode screen): Used to determine whether query isolates are close to a set of reference isolates (e.g., lab control strains, strains related to an outbreak, etc.)

Given one or more user-provided reference isolates (--ref_reads; --ref_fasta), get alignment statistics and SNP distances between all reference and query isolates (--reads; --fasta)

2) "SNP Pipeline Mode" (--runmode snp): Used to generate pairwise distances and alignments for a set of query isolates

Generate pairwise SNP distances and alignments for 2+ isolates (--reads; --fasta) based on comparisons to:

All CSP2 sequence comparisons happen at the assembly level, but if reads are provided CSP2 will perform a genome assembly using SKESA. In either case, CSP2 then calls MUMmer for alignment. If a sufficient portion of the reference genome is aligned (--min_cov), that data is passed through a set of filters that largely mimic those from the CFSAN SNP Pipeline, including the automated removal of:

This final dataset is summarized into a .snpdiffs file, which contains:

  1. A one-line header with alignment statistics
  2. A BED file of contig mappings that pass QC
  3. Information about SNPs (if present)

To avoid unnecessary realignment, once a .snpdiffs file is generated under a particular set of QC parameters (which is hardcoded into the .snpdiffs file as the "QC_String") these files can be used in other CSP2 runs via the --snpdiffs argument (if using the same QC parameters).


Software Dependencies

The following software are required to run CSP2. Software version used during CSP2 development noted in parentheses.


Installing CSP2

CSP2 can be installed by cloning the GitHub repo and configuring the nextflow.config and profiles.config to suit your needs

git clone https://github.com/CFSAN-Biostatistics/CSP2.git

Tips for configuring CSP2

CSP2 options can be specified on the command line, or through the Nextflow configuration files detailed in the next section. Feel free to skip this section if you're familiar with editing Nextflow configuration files.

There are two main configuration files associated with CSP2:

profiles {
    standard {
        process.executor = 'local'
        params.cores = 1
        params.python_module = ""
        params.mummer_module = ""
        params.skesa_module = ""
        params.bedtools_module = ""
        params.mash_module = ""
        params.bbtools_module = ""
    }
    slurmHPC {
        process.executor = 'slurm'
        params.cores = 20
        params.python_module = "/nfs/software/modules/python/3.8.1"
        params.mummer_module = "/nfs/software/modules/mummer/4.0.0"
        params.skesa_module = "/nfs/software/modules/skesa/2.5.0"
        params.bedtools_module = "/nfs/sw/Modules/bedtools"
        params.bbtools_module = "/nfs/software/modules/bbtools/38.94"
        params.mash_module = "/nfs/software/modules/mash/2.3"
        params.trim_name = "_contigs_skesa"
    }
}

Options with defaults include:

Parameter Description Default Value
--outroot Base directory to create output folder $CWD
--out Name of the output folder to create (must not exist) CSP2_(java.util.Date().getTime())
--forward Full file extension for forward/left reads of query _1.fastq.gz
--reverse Full file extension for reverse/right reads of reference _2.fastq.gz
--ref_forward Full file extension for forward/left reads of reference _1.fastq.gz
--ref_reverse Full file extension for reverse/right reads of reference _2.fastq.gz
--readext Extension for single-end reads for query fastq.gz
--ref_readext Extension for single-end reads for reference fastq.gz
--min_cov Do not analyze queries that cover less than % of the reference assembly 85
--min_iden Only consider alignments where the percent identity is at least % 99
--min_len Only consider alignments that span at least bp 500
--dwin A comma-separated list of windows to check SNP densities 1000,125,15
--wsnps The maximum number of SNPs allowed in the corresponding window from --dwin 3,2,1
--query_edge Only consider SNPs that occur within bp of the end of a query contig 150
--ref_edge Only consider SNPs that occur within bp of the end of a reference contig 150
--n_ref The number of reference isolates to consider (only applied if CSP2 is choosing references) 1
--rescue If running in SNP Pipeline mode, rescue edge-filtered SNPs that are not edge filtered in 1+ query Unset (Do not rescue)
Options without defaults include: Parameter Description
--reads Location of query read data (Path to directory, or path to file with multiple directories)
--fasta Location of query assembly data (Path to directory containing FASTAs, path to FASTA, file with list of multiple FASTA paths)
--ref_reads Location of reference read data (Path to directory, or path to file with multiple directories)
--ref_fasta Location of reference assembly data (Path to directory containing FASTAs, path to FASTA, path to multiple FASTAs)
--python_module Name of Python module if 'module load python' statement is required.
--mummer_module Name of MUmmer module if 'module load mummer' statement is required.
--skesa_module Name of SKESA module if 'module load skesa' statement is required.
--bedtools_module Name of BEDTools module if 'module load bedtools' statement is required.
--mash_module Name of MASH module if 'module load mash' statement is required.
--trim_name A string in assembly file names that you want to remove from sample IDs (e.g., _contigs_skesa)

Examples

The CSP2 Test Data repo contains small test datasets to ensure things are running as expected. Here are a few examples of how you can use CSP2 in screening mode or in SNP pipeline mode.

Screening Mode (Example)

Situation: As part of a long-term microbiology experiment, you perform weekly WGS sequencing on isolates as they evolve under different selective conditions. As results, your DNA sequencing facility returns raw WGS reads and assembled genomes.

During Week 42, analyses start detecting high numbers of mutations, and assembly-based results are not concordant with read-based results. You suspect that either the reads or assembly you were given may be from their lab control strain, but you want to check first.

The data:

To run this example locally, where Nextflow, SKESA, MUMmer, Python, and BEDTools are installed on your path, run:

nextflow run CSP2.nf --out Test_Output/Contamination_Screen --runmode screen --ref_fasta assets/Screen/Assembly/Lab_Control.fasta --fasta assets/Screen/Assembly/Week_42_Assembly.fasta --reads assets/Screen/Reads --forward _1.fq.gz --reverse _2.fq.gz --readext fq.gz
nextflow run CSP2.nf                                    // Run CSP2  
--out Test_Output/Contamination_Screen                  // Save results to ./Test_Output/Contamination_Screen  
--runmode screen                                        // Compare each query to the reference
--ref_fasta assets/Screen/Assembly/Lab_Control.fasta    // Compare all queries to this reference  
--fasta assets/Screen/Assembly/Week_42_Assembly.fasta   // Include this assembly as a query
--reads assets/Screen/Reads                             // Include any read datasets from this directory as queries
--forward _1.fq.gz                                      // Forward reads don't match the default '_1.fastq.gz'
--reverse _2.fq.gz                                      // Reverse reads don't match the default '_2.fastq.gz'
--readext fq.gz                                         // Reads don't match the default 'fastq.gz'

If you're running on an HPC and you need to load modules, you could include your custom profile:

# Load Nextflow module if necessary
module load nextflow

nextflow run CSP2.nf -profile slurmHPC --out Test_Output/Contamination_Screen --runmode screen --ref_fasta assets/Screen/Assembly/Lab_Control.fasta --fasta assets/Screen/Assembly/Week_42_Assembly.fasta --reads assets/Screen/Reads --forward _1.fq.gz --reverse _2.fq.gz --readext fq.gz
nextflow run CSP2.nf                                    // Run CSP2  
-profile slurmHPC                                       // Choose run profile (**note single hyphen**)
--out Test_Output/Contamination_Screen                  // Save results to ./Test_Output/Contamination_Screen  
--runmode screen                                        // Compare each query to the reference
--ref_fasta assets/Screen/Assembly/Lab_Control.fasta    // Compare all queries to this reference  
--fasta assets/Screen/Assembly/Week_42_Assembly.fasta   // Include this assembly as a query
--reads assets/Screen/Reads                             // Include any read datasets from this directory as queries
--forward _1.fq.gz                                      // Forward reads don't match the default '_1.fastq.gz'
--reverse _2.fq.gz                                      // Reverse reads don't match the default '_2.fastq.gz'
--readext fq.gz                                         // Reads don't match the default 'fastq.gz'

Output

If all went well, you should see something like this:

Nextflow print output for screening run

From top to bottom, we can see that CSP2:

Let's take a look to see what was generated:

ls Test_Output/Contamination_Screen

Assemblies/
Isolate_Data.tsv
logs/
MUMmer_Output/
Raw_MUMmer_Summary.tsv
Screening_Results.tsv
snpdiffs/

-Note: This output is available to inspect in assets/Screen/Output

Directories

Files

SNP Pipeline Mode (Example)

Situation: You dug 10 soil samples and isolated a single microbe from each. You want to check the relatedness of the cultured isolates.

The data:

In this case, we want to use --runmode snp, because we want to calculate the pairwise distances between all isolates and generate alignments. We will let RefChooser choose the best reference genome.

nextflow run CSP2.nf --out Test_Output/Soil_Analysis --runmode snp --fasta assets/SNP/
nextflow run CSP2.nf              // Run CSP2  
--out Test_Output/Soil_Analysis   // Save results to ./Test_Output/Soil_Analysis  
--runmode snp                     // Compare all queries to each other
--fasta assets/SNP                // Gather query assemblies from this directory (you can also point to a text file containing multiple FASTA paths)

Output

If all went well, you should see something like this:

Nextflow print output for SNP Pipeline run

From top to bottom, we can see that CSP2:

Let's take a look to see what was generated:

ls Test_Output/Soil_Analysis

Isolate_Data.tsv
logs/
MUMmer_Output/
Raw_MUMmer_Summary.tsv
SNP_Analysis/
snpdiffs/

-Note: This output is available to inspect in assets/SNP/Output

Directories

Files

To see if the reference genome has an impact on SNP distance estimation, you can test one or more via --ref_fasta / --ref_reads or have CSP2 choose --n_ref isolates

nextflow run CSP2.nf --out Test_Output/Soil_Analysis --runmode snp --fasta assets/SNP/ --n_ref 3
nextflow run CSP2.nf              // Run CSP2  
--out Test_Output/Soil_Analysis   // Save results to ./Test_Output/Soil_Analysis  
--runmode snp                     // Compare all queries to each other
--fasta assets/SNP                // Gather query assemblies from this directory
--n_ref 3                         // Choose the top 3 references and run 3 sepearate analyses