RasmussenLab / phamb

Downstream processing of VAMB binning for Viral Elucidation
MIT License
44 stars 8 forks source link

phamb

A Phage from metagenomic bins (phamb) discovery approach used to isolate metagenome derived viromes and High-quality viral genomes. phamb is now published in Nature Communications, have a look and let us know if you have any questions.

The repository contains scripts and workflows used in our viral follow up study on the binning tool VAMB where we have benchmarked not only the quality and quantity of viral MAGs but also the viral overlap with metaviromes.

We have applied this approach to 3 different datasets and recovered up to 6,077 High-quality genomes from 1,024 viral populations, this is 200% more compared to only evaluation single-contigs. Similar to what we have observed for Bacterial bins, VAMB achieves high intra-VAMB-cluster ANI (>97.5%) also for viral bins, our best example here is accurate clustering of crAss-like bins found in the IBD Human Microbiome Project 2 dataset.

In our analysis, CheckV has been important for assessing the actual gain of using viral MAGs relative to single-contig evaluation, a big kudos to Nayfach et al. for this great tool.

[Prerequisites & Installation]

In order to run parallel annotations of contigs and running the Random Forest model you need snakemake and scikit-learn v. 1.0.2. The snakemake workflows comes with conda-environments, thus dependencies and programmes are automatically installed. Phamb can now be installed via bioconda thanks to @jayramr!

### New dependencies *Recommended*
conda install -c conda-forge mamba
mamba create -n phamb python=3.9
conda activate phamb 
mamba install -c conda-forge -c bioconda snakemake
mamba install -c conda-forge -c bioconda cython
mamba install -c conda-forge -c bioconda pygraphviz
mamba install -c conda-forge -c bioconda phamb
### Clone repository
git clone the repository https://github.com/RasmussenLab/phamb.git

### Alternative to bioconda - Quick install with pip
pip install -e .

### Test installation
mkdir -p testout 
run_RF.py test/contigs.fna.gz test/clusters.tsv test testout

1. MAG annotation for isolating Metagenomic derived viromes

Database and file requirements

VAMB clusters and concatenated assemblies.

contigs.fna.gz #Concatenated assembly 
vamb/clusters.tsv   #Clustered contigs based on the above contigs.fna.gz file 

Furthermore.

How to Run - Parallel annotation

Copy the phamb repository, extract the mag_annotation workflow and split contigs (using the provided script) to allow annotation to be run in parallel. If you have relatively few contigs or have the patience to annotate all contigs in one batch you can skip the Snakemake part.

mkdir -p projectdir 
cd projectdir 
git clone the repository https://github.com/RasmussenLab/phamb.git
cp -r phamb/workflows/mag_annotation .
python split_contigs.py -c contigs.fna.gz 

If everything is good and set, you can run the snakemake pipeline.

# Local 
snakemake -s mag_annotation/Snakefile --use-conda -j <threads>

#Aggregate results
mkdir annotations
cat sample_annotation/*/*hmmMiComplete105.tbl > annotations/all.hmmMiComplete105.tbl
cat sample_annotation/*/*hmmVOG.tbl > annotations/all.hmmVOG.tbl
cat sample_annotation/*/*_dvf/*dvfpred.txt > annotations/DVF.predictions.txt

# Clean the DVF files for multiple headers.
head -n1 annotations/DVF.predictions.txt > DVF.header # get first header
grep -v 'pvalue' annotations/DVF.predictions.txt > DVF.predictions # get predictions 
cat DVF.header DVF.predictions > annotations/all.DVF.predictions.txt # combine

Dependent on the number of samples, it may be relevant to run the Snake-flow on a High performance computing (HPC) server.

# HPC - this won't work unless you specify a legit group on your HPC in `config.yaml`
snakemake -s Snakefile --cluster qsub -j <threads> --use-conda

How to Run - not in parallel - quick and dirty

Make sure to have Prodigal, hmmer and DeepVirFinder depedencies installed. Check under mag_annotation/envs for relevant conda environments.

mkdir annotations
gunzip contigs.fna.gz
python3 /user/DeepVirFinder/dvf.py -i contigs.fna -o DVF -l 2000 -c 1
mv DVF/contigs.fna_gt2000bp_dvfpred.txt annotations/all.DVF.predictions.txt
prodigal -i contigs.fna -d genes.fna -a proteins.faa -p meta -g 11
hmmsearch --cpu {threads} -E 1.0e-05 -o output.txt --tblout annotations/all.hmmMiComplete105.tbl <micompleteDB> proteins.faa
hmmsearch --cpu {threads} -E 1.0e-05 -o output.txt --tblout annotations/all.hmmVOG.tbl <VOGDB> proteins.faa
gzip contigs.fna

Run the RF model

Running the provided script, the virome bins are written to a fasta file and bin-annotations are summarised in vambbins_aggregated_annotation.txt.

run_RF.py contigs.fna.gz vamb/clusters.tsv annotations resultdir

ls resultsidr
resultdir/vambbins_aggregated_annotation.txt
resultdir/vambbins_RF_predictions.txt
resultsdir/vamb_bins #Concatenated predicted viral bins - writes bins in chunks to files so there might be several! 

We recommend VAMB bins to be evaluated with a dedicated Viral evaluation tool like CheckV or VIBRANT to identify HQ viruses.

checkv end_to_end resultsdir/vamb_bins/vamb_bins.1.fna checkv_vamb_bins  

Further information

The RF model take few variables to make an accurate distinction. binsize (bp) nhallm distinct_VOGs_factor cluster_DVF_score
2.000.000 100 0.2 0.3
60.000 3 1.3 0.7