Hydro3639 / NanoPhase

Reference-quality genome reconstruction from complex metagenomes (or bacterial isolates) using only Nanopore long reads or both long and short reads (hybrid strategy)
MIT License
21 stars 1 forks source link

NanoPhase

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

nanophase is an easy-to-use pipeline to generate reference-quality MAGs using only Nanopore long reads (long-read-only strategy) or both Nanopore long and Illumina short reads (hybrid strategy) from complex metagenomes. Since nanophase v0.2.0, it also supports to generate reference-quality genomes from bacterial/archaeal isolates (long-read-only or hybrid strategy). If nanophase is interrupted, it will resume from the last completed stage.

Installation instructions

It is advised to first install conda (miniconda3 is suggested), then add required channels and install mamba following the instruction below:

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install mamba -n base -c conda-forge -y
mamba init && source ~/.bashrc ## only once, if mamba was still not in your local env, try opening a new terminal

1). Recommend: install nanophase including all nanophase dependancies via conda/mamba (mamba install is much faster than conda install). It should be finished in ~5 mins (depends on your local internet).

mamba create -n nanophase -c nanophase nanophase -y

or

mamba create -n nanophase python=3.8 -y && mamba activate nanophase && mamba install -c nanophase nanophase -y

GTDB and PLSDB download

Please note that GTDB/PLSDB database will not download automatically via the above installation, so the user can specify a friendly storage location because they take a lot of storage space: GTDB (~84G and PLSDB (~3.4G). Or if you have downloaded the above databases before, you may skip the first download step, just following the location setting step.

## download database: May skip if you have done before or GTDB and PLSDB have been downloaded in the server
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz && tar xvzf gtdbtk_data.tar.gz
wget https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/plsdb.fna.bz2 && bunzip2 plsdb.fna.bz2
conda activate nanophase

## setting location
echo "export GTDBTK_DATA_PATH=/path/to/release/package/" > $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh
## Change /path/to/release/package/ to the real location where you stored the GTDB
echo "export PLSDB_PATH=/path/to/plsdb.fna" >> $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh
## Change /path/to/plsdb.fna to the real location where you stored the PLSDB
conda deactivate && conda activate nanophase ## require re-activate nanophase

Usage

Please look at nanophase usage tutorial to verify the nanophase installation via an example dataset.

Briefly, you may check if all necessary packages have been installed sucessfully in the nanophase env using the following command.

conda activate nanophase ## if not in the nanophase env

nanophase check

Check software availability and locations
The following packages have been found
#package             location
flye                 /path/to/miniconda3/envs/nanophase/bin/flye
metabat2             /path/to/miniconda3/envs/nanophase/bin/metabat2
maxbin2              /path/to/miniconda3/envs/nanophase/bin/run_MaxBin.pl
SemiBin              /path/to/miniconda3/envs/nanophase/bin/SemiBin
metawrap             /path/to/miniconda3/envs/nanophase/bin/metawrap
checkm               /path/to/miniconda3/envs/nanophase/bin/checkm
racon                /path/to/miniconda3/envs/nanophase/bin/racon
medaka               /path/to/miniconda3/envs/nanophase/bin/medaka
polypolish           /path/to/miniconda3/envs/nanophase/bin/polypolish
POLCA                /path/to/miniconda3/envs/nanophase/bin/polca.sh
bwa                  /path/to/miniconda3/envs/nanophase/bin/bwa
seqtk                /path/to/miniconda3/envs/nanophase/bin/seqtk
minimap2             /path/to/miniconda3/envs/nanophase/bin/minimap2
BBMap                /path/to/miniconda3/envs/nanophase/bin/BBMap
parallel             /path/to/miniconda3/envs/nanophase/bin/parallel
perl                 /path/to/miniconda3/envs/nanophase/bin/perl
samtools             /path/to/miniconda3/envs/nanophase/bin/samtools
gtdbtk               /path/to/miniconda3/envs/nanophase/bin/gtdbtk
fastANI              /path/to/miniconda3/envs/nanophase/bin/fastANI
blastp               /path/to/miniconda3/envs/nanophase/bin/blastp
All required packages have been found in the environment. If the above certain packages integrated into nanophase were used in your investigation, please give them credit as well :)

If all pakcages have been installed sucessfully in the nanophase env, type nanophase -h for more usage information.

nanophase -h

nanophase v=0.2.3

Main modules
        check                   check if all packages have been installed
        meta                    genome assembly, binning, quality assessment and classification for metagenomic datasets
        isolate                 genome assembly, binning, quality assessment and classification for bacterial isolates

Test modules
        args                    Antibiotic Resistance Genes (ARGs) identification from reconstructed MAGs
        plasmid                 Plasmid identification from reconstructed MAGs

Other options
        -h | --help             show the help message
        -v | --version          show nanophase version

example usage:
        nanophase check                                                                                         ## package availability checking
        nanophase meta -l ont.fastq.gz -t 16 -o nanophase-out                                                   ## meta::long reads only
        nanophase meta -l ont.fastq.gz --hybrid -1 sr_1.fastq.gz -2 sr_2.fastq.gz -t 16 -o nanophase-out        ## meta::hybrid strategy
        nanophase isolate -l ont.fastq.gz -t 16 -o nanophase-out                                                ## isolate::long reads only
        nanophase isolate -l ont.fastq.gz --hybrid -1 sr_1.fastq.gz -2 sr_2.fastq.gz -t 16 -o nanophase-out     ## isolate::hybrid strategy
        nanophase args -i Final-bins -x fasta -o nanophase.ARGs.summary.txt                                     ## ARGs identification
        nanophase plasmid -i Final-bins -x fasta -o nanophase.pls.summary.txt                                   ## Plasmids identification

Each module is run separately. For example, to check the nanophase meta module, type nanophase meta -h for more usage information.

nanophase meta -h

nanophase v=0.2.3

arguments:
        --long_read_only        only Nanopore long reads were involved [default: on]
        --hybrid                both short and long reads were required [Optional]
        -l, --long              Nanopore reads: fasta/q file that basecalled by Guppy 5+ or using 20+ chemistry was recommended if only Nanopore reads were included [Mandatory]
        -1                      Illumina short reads: fasta/q paired-end #1 file [Optional]
        -2                      Illumina short reads: fasta/q paired-end #2 file [Optional]
        -m, --medaka_model      medaka model used for medaka polishing [default: r1041_e82_400bps_sup_g615]
        -e, --environment   Build-in model of SemiBin [default: wastewater]; detail see: SemiBin single_easy_bin -h
        -t, --threads           number of threads that used for assembly and polishing [default: 16]
        -o, --out               output directory [default: ./nanophase-out]
        -h, --help              print help information and exit
        -v, --version           show version number and exit

output sub-folders:
        01-LongAssemblies       sub-folder containing information of Nanopore long-read assemblies (assembler: metaFlye)
        02-LongBins             sub-folder containing the initial bins with relatively low-accuracy quality
        03-Polishing            sub-folder containing polished bins

example usage:
        nanophase meta -l ont.fastq.gz -t 16 -o nanophase-out                                                 ## long reads only
        nanophase meta -l ont.fastq.gz --hybrid -1 sr_1.fastq.gz -2 sr_2.fastq.gz -t 16 -o nanophase-out      ## hybrid strategy