PacificBiosciences / HiFi-16S-workflow

Nextflow pipeline to analyze PacBio HiFi full-length 16S data
BSD 3-Clause Clear License
59 stars 15 forks source link

HiFi Full-length 16S analysis with pb-16S-nf

The pipeline is currently under active development; we welcome your feedback to help improve it.

Workflow overview and output

alt text

This Nextflow pipeline is designed to process PacBio HiFi full-length 16S data into high- quality amplicon sequence variants (ASVs) using QIIME 2 and DADA2. The pipeline provides a set of visualizations using the QIIME 2 framework for interactive plotting. The pipeline generates an HTML report for the important statistics and top taxonomies. The outputs and stages of this pipeline are documented here.

We provide a sample report generated using this pipeline based on 8 replicates from the ATCC MSA-1003 mock community sequenced on a Sequel II system (Link). Right-click this link, save it on your computer, then double-click to open the sample report. All other important outputs from the pipeline are available in the examples folder when you clone this repository.

Installation and usage

This pipeline runs using Nextflow Version 22 and later. If you have Singularity or Docker on your cluster, we recommend using Singularity or Docker to run the pipeline by specifying -profile singularity or -profile docker when running the pipeline. Singularity will pull the docker images to the folder $HOME/nf_conda/singularity.

By default, all software dependencies are managed using Conda. Nextflow will use Conda to build the required environment so there is no need to manually build environments. You can install Nextflow by following the steps here (documentation) or by using Conda itself:

conda install -c bioconda nextflow

# If this is your first time using conda
conda init

After installing Nextflow, clone the repository and download databases using the following commands. To update the pipeline in the future, type git pull.

git clone https://github.com/PacificBiosciences/pb-16S-nf.git
cd pb-16S-nf
nextflow run main.nf --download_db
# With docker (If you use docker, add -profile docker to all Nextflow-related command)
nextflow run main.nf --download_db -profile docker

After downloading the databases, run the following command in the cloned folder to see the options for the pipeline:

nextflow run main.nf --help

  Usage:
  This pipeline takes in the standard sample manifest and metadata file used in
  QIIME 2 and produces QC summary, taxonomy classification results and visualization.

  For samples TSV, two columns named "sample-id" and "absolute-filepath" are
  required. For metadata TSV file, at least two columns named "sample_name" and
  "condition" to separate samples into different groups.

  nextflow run main.nf --input samples.tsv --metadata metadata.tsv \\
    --dada2_cpu 8 --vsearch_cpu 8

  By default, sequences are first trimmed with cutadapt. If adapters are already trimmed, you can skip 
  cutadapt by specifying "--skip_primer_trim".

  Other important options:
  --front_p    Forward primer sequence. Default to F27. (default: AGRGTTYGATYMTGGCTCAG)
  --adapter_p    Reverse primer sequence. Default to R1492. (default: AAGTCGTAACAAGGTARCY)
  --filterQ    Filter input reads above this Q value (default: 20).
  --downsample    Limit reads to a maximum of N reads if there are more than N reads (default: off)
  --max_ee    DADA2 max_EE parameter. Reads with number of expected errors higher than
              this value will be discarded (default: 2)
  --minQ    DADA2 minQ parameter. Reads with any base lower than this score 
            will be removed (default: 0)
  --min_len    Minimum length of sequences to keep (default: 1000)
  --max_len    Maximum length of sequences to keep (default: 1600)
  --pooling_method    QIIME 2 pooling method for DADA2 denoise see QIIME 2 
                      documentation for more details (default: "pseudo", alternative: "independent") 
  --maxreject    max-reject parameter for VSEARCH taxonomy classification method in QIIME 2
                 (default: 100)
  --maxaccept    max-accept parameter for VSEARCH taxonomy classification method in QIIME 2
                 (default: 100)
  --min_asv_totalfreq    Total frequency of any ASV must be above this threshold
                         across all samples to be retained. Set this to 0 to disable filtering
                         (default 5)
  --min_asv_sample    ASV must exist in at least min_asv_sample to be retained. 
                      Set this to 0 to disable. (default 1)
  --vsearch_identity    Minimum identity to be considered as hit (default 0.97)
  --rarefaction_depth    Rarefaction curve "max-depth" parameter. By default the pipeline
                         automatically select a cut-off above the minimum of the denoised 
                         reads for >80% of the samples. This cut-off is stored in a file called
                         "rarefaction_depth_suggested.txt" file in the results folder
                         (default: null)
  --dada2_cpu    Number of threads for DADA2 denoising (default: 8)
  --vsearch_cpu    Number of threads for VSEARCH taxonomy classification (default: 8)
  --cutadapt_cpu    Number of threads for primer removal using cutadapt (default: 16)
  --outdir    Output directory name (default: "results")
  --vsearch_db  Location of VSEARCH database (e.g. silva-138-99-seqs.qza can be
                downloaded from QIIME database)
  --vsearch_tax    Location of VSEARCH database taxonomy (e.g. silva-138-99-tax.qza can be
                   downloaded from QIIME database)
  --silva_db   Location of Silva 138 database for taxonomy classification 
  --gtdb_db    Location of GTDB r202 for taxonomy classification
  --refseq_db    Location of RefSeq+RDP database for taxonomy classification
  --skip_primer_trim    Skip all primers trimming (switch off cutadapt and DADA2 primers
                        removal) (default: trim with cutadapt)
  --skip_nb    Skip Naive-Bayes classification (only uses VSEARCH) (default: false)
  --colorby    Columns in metadata TSV file to use for coloring the MDS plot
               in HTML report (default: condition)
  --run_picrust2    Run PICRUSt2 pipeline. Note that pathway inference with 16S using PICRUSt2
                    has not been tested systematically (default: false)
  --download_db    Download databases needed for taxonomy classification only. Will not
                   run the pipeline. Databases will be downloaded to a folder "databases"
                   in the Nextflow pipeline directory.
  --publish_dir_mode    Outputs mode based on Nextflow "publishDir" directive. Specify "copy"
                        if requires hard copies. (default: symlink)
  --version    Output version

To test the pipeline, run the example below. Note that the database paths should be changed to their respective locations on your server if they are different. (See the parameters above.) If you follow the command above, the databases will be downloaded into a databases folder in the pb-16S-nf folder and you do not need to specify the path. The conda environment will be created by default in the $HOME/nf_conda folder unless changed in the nextflow.config file. Once the conda environment is created, it will be reused by any future run.

# Create sample TSV for testing
echo -e "sample-id\tabsolute-filepath\ntest_data\t$(readlink -f test_data/test_1000_reads.fastq.gz)" > test_data/test_sample.tsv

nextflow run main.nf --input test_data/test_sample.tsv \
    --metadata test_data/test_metadata.tsv -profile conda \
    --outdir results

# To test using Singularity or docker (change singularity to docker)
nextflow run main.nf --input test_data/test_sample.tsv \
    --metadata test_data/test_metadata.tsv -profile singularity \
    --outdir results

To run this pipeline on your data, create the sample TSV and metadata TSV following the test data format (for metadata, if you do not have any grouping, put any words in the "condition" column) and run the workflow similar to the above. Remember to specify the --outdir directory to avoid overwriting existing results.

HPC and job scheduler usage

The pipeline uses "Local" by default to run jobs on HPC. This can be changed in the nextflow.config file under executor to use HPC schedulers such as Slurm, SGE and so on using Nextflow's native support. For example, to use Slurm, open the nextflow.config file and change executor = 'Local' to executor = 'slurm' and specify the partition to be used using queue = PARTITION. See the Nextflow documentation for the available executors and parameters. CPUs for VSEARCH, DADA2 and cutadapt can be specified as command-line parameters. For all the other processes, they use any of the default labels in nextflow.config and can be changed according to your needs.

Note that the nextflow.config file, by default, will generate the workflow DAG and resources reports to help in benchmarking the resources required. See the report_results folder created after the pipeline finishes running for the DAG and resources report.

Speeding up DADA2 denoise

By default, the pipeline pools all samples into one single qza file for DADA2 denoise (using the default pseudo-pooling approach by DADA2). This is designed to maximize the sensitivity to low frequency ASVs. For example, an ASV with just 2 reads in sample 1 may be discarded, but if the same exact ASV is seen in another sample, this gives the algorithm higher confidence that it is real. However, when the samples are highly diverse (such as with environmental samples), this can become very slow.

If a (possibly) minor loss in sensitivity is acceptable, the pipeline allows you to "split" the input samples into different groups that will be denoised separately. This is done using a pool column in the metadata.tsv input. Example:

sample_name     condition       pool
bc1005_bc1056   RepA    RepA
bc1005_bc1057   RepA    RepA
bc1005_bc1062   RepA    RepA
bc1005_bc1075   RepA    RepA
bc1005_bc1100   RepB    RepB
bc1007_bc1075   RepB    RepB
bc1020_bc1059   RepB    RepB
bc1024_bc1111   RepB    RepB

The TSV above will split the 8 samples into two groups (RepA and RepB) and denoise them separately. After denoising, all denoised ASVs and statistics are merged again for downstream filtering and processing. This allows you to maximize sensitivity within a group of samples and speed up the pipeline considerably. On the other hand, if each sample has been sequenced deeply, you can denoise each sample individually by setting a unique group for each sample (e.g. replicating the sample_name column as the pool column) to process the samples quickly.

Run time and compute requirements

We recommend at least 32 CPUs for most sample types. The run time highly depends on the complexity of the samples in addition to the total number of reads. Shown here are examples of run times for data tested with this pipeline using 32 CPUs:

Sample types Number of samples Number of FL reads Total ASVs Pipeline run time Pipeline max memory
Oral 891 8.3m 5417 2.5h 34 GB
Gut 192 2.2m 1593 2h 30 GB
Gut 192 2.2m 10917 5.5h 30 GB
Gut 192 16.7m 17293 13h 87 GB
Wastewater 33 2.14m 11462 12h 47 GB
Mock community 264 12.8m 84 4h 44 GB

Frequently asked questions (FAQ)

References

QIIME 2

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.