asadprodhan / sangerFlow

A bioinformatics pipeline that automates the Sanger amplicon sequencing data analysis of thousands of samples in parallel.
GNU General Public License v3.0
3 stars 0 forks source link
nextflow-pipeline singularity-containers

sangerFlow, a Sanger sequencing-based bioinformatics pipeline for pests and pathogens identification


AUTHOR: Dr Asad Prodhan https://asadprodhan.github.io/

License GPL 3.0 DOI


About the sangerFlow

DNA barcoding is a powerful tool to identify species. It involves i) DNA or RNA extraction from the specimen, ii) performing a Polymerase Chain Reaction (PCR) targeting a DNA barcode, and iii) high-quality sequencing such as Sanger sequencing of the PCR product. The sequencing data come as forward- and reverse reads that require manually quality control, alignment, and sequence similarity analysis using web-based Blastn to identify the species. However, this manual analysis might be a limiting factor in biosecurity surveillance or diagnosis settings that requires high-throughput analysis. sangerFlow addresses this challenge by automating this entire analysis (Fig. 1).



Figure 1: sangerFlow automates pest and pathogen identification using PCR Sanger sequencing data.


sangerFlow automatically analyses the forward and reverse reads from the PCR Sanger sequencing data. The pipeline takes the fasta files as input and returns Blastn hits i.e., species identifications for each specimen (Fig. 2). Therefore, the pipeline is automated and scalable. Furthermore, the pipeline is written using the modern workflow manager, Nextflow; and Singularity containers. Therefore, it does not require software installation except Nextflow and Singularity, software subscription, or programming expertise from the end users. All these features make the pipeline ideal for large-scale Sanger amplicon sequencing data analysis and user-friendly.



Figure 2: sangerFlow pipeline.



How to use the sangerFlow

Follow the following steps to use sangerFlow.

Step 1: Install the required softwares

conda create -n sangerFlow
conda activate sangerFlow
conda install -c bioconda nextflow
nextflow -h

If you see the Nextflow options like Fig. 3, then the Nextflow has been installed


Figure 3: Nextflow options.

conda install -c conda-forge singularity
singularity -h

If you see the Singularity options like Fig. 4, then the Singularity has been installed


Figure 4: Singularity options.

Step 2: Prepare a sample description file

See Fig. 5. This is an example of a sample description file. It is a ‘tsv’ file format.


Figure 5: Sample description file.

Step 3: Download a blastn database from NCBI

Step 4: Run sangerFlow

dos2unix *
chmod +x *
nextflow run asadprodhan/sangerFlow -r VERSION-NUMBER --db="/path/to/your/blastn_database"

Collect the VERSION-NUMBER from the sangerFlow GitHub home page. It is located as shown in the red box in Fig. 6.


Figure 6: sangerFlow version number location.


You can set the following thresholds for the blastn analysis using the following flags

--evalue=XX. Default is 0.1

--cpus=XX. Default is 18

--topHits=XX. Default is 5



For example

nextflow run asadprodhan/sangerFlow -r VERSION-NUMBER --evalue=0.05 --topHits=1 --cpus=16 --db="/path/to/your/blastn_database"



Outputs

When the run is successfully completed, there will be three new directories (results, temp, and work) in your working directory


Results

This directory contains the blastn results. One tsv file per sample. In addition, there will be a master blastn result sheet named concatenatedHits_withHeaders.tsv. This file contains the user-defined top most Blastn hits of all the samples (Fig. 7).



Figure 7: sangerFlow master result sheet containing the user-defined top most Blastn hits of all the samples.


Temp

This directory contains all the intermediate files in case you will need to have a look at them.


Work

This directory contains one sub-directory per sample. The work directory is created by Nextflow by default. You can delete it to free up space in your computer.



The End