davidesangui / metapresence

Metapresence is a Python tool to calculate metrics to assess the evenness of the coverage distribution on DNA sequences, in order to infer their presence in a given sample.
7 stars 0 forks source link

Metapresence

Calculation of metrics for evaluating the distribution of aligned reads onto nucleotidic sequences.
While Metapresence was developed for the purpose of identifying present species in shotgun metagenomic sequencing data, it can potentially be used to assess the evenness of read distribution onto any kind of DNA/RNA sequences starting from the result of a sequence alignment, provided that the sequences are in FASTA format and the alignment is reported as a sorted and indexed BAM file.

The associated article can be found at: https://doi.org/10.1128/msystems.00213-24
If you use Metapresence, please cite via: Sanguineti D, Zampieri G, Treu L, Campanaro S.0.Metapresence: a tool for accurate species detection in metagenomics based on the genome-wide distribution of mapping reads. mSystems0:e00213-24.https://doi.org/10.1128/msystems.00213-24

Installation

The Python packages numpy and pysam are needed to use Metapresence. You can either create a conda environment and install them in the following way:

# create conda environment
conda create metapresence_env

# activate conda environment
conda activate metapresence_env

# install packages
conda install anaconda::numpy
conda install bioconda::pysam

Or you can install them with pip:

pip install numpy
pip install pysam

Once you have installed the two dependencies, simply download the file < metapresence.py > from this repository, and run it with Python:

python3 metapresence.py -h

Usage

The toy_example will be used to illustrate the basic usage of Metapresence. In this example dataset, a reference database comprising seven genomes is present. 250,000 sequencing reads were generated with CAMISIM (https://github.com/CAMI-challenge/CAMISIM/) from three of these genomes (those without the suffix "_added").
We start with a set of reference genomes and shotgun metagenomic sequencing reads. The first thing to do is to align the sequencing reads on the reference genomes to generate a sorted bam file, which can be then indexed. Bowtie2 (https://github.com/BenLangmead/bowtie2) and Samtools (https://github.com/samtools/samtools) will be used:

# generate a multifasta from reference genomes
cat ref_genomes/* > allgenomes.fasta

# generate bowtie2 index
bowtie2-build --threads 4 allgenomes.fasta allgenomes_index

# align reads with bowtie2, convert to bam and sort with samtools
bowtie2 -1 toy_R1.fastq.gz -2 toy_R2.fastq.gz -x allgenomes_index -p 4 | samtools view -b -@ 4 | samtools sort -@ 4 > alignment.sorted.bam

# generate aligment index (alignment.sorted.bam.bai) with samtools
samtools index -@ 4 alignment.sorted.bam 

Metapresence can now be launched giving it as inputs the folder containing the reference genome files in fasta format, and the sorted bam file:

python3 metapresence.py ref_genomes alignment.sorted.bam -p 4 -o example

< -p > option defines the number of processes to be used for parallelization. < -o > is the prefix of output files.
Two output files are generated:

The input sequences can be either a folder of fasta files, or a fasta file with one or more contigs. If the case is the latter, a text file should be provided with the flag -input_bins. This file is necessary to cluster different contigs as belonging to the same sequence (for instance, a genome). In this text file, each line should list the names of a contig and of the sequence to which it belongs, tab-separated.
If each contig should be treated as an independent sequence, then each line of this text file should list the name of a contig and, tab-separated, again the name of the contig. More simply, the same result can be obtained by setting the flag --all_contigs.

Additional scripts

Parsing metric output

The time consuming step of metapresence.py is that of parsing the bam file to calculate the metric values. On the contrary, inferring relative abundances from the metric values is straightforward and fast.
It may be useful to test different metrics thresholds depending on the specific instance under investigation. In this case, it is useless and time-consuming to re-calculate the metric values by running metapresence.py. In this repository, the script parse_metrics.py is available in the utils subdirectory. This script takes as input the metric output of metapresence.py (*_metrics.tsv) and outputs an abundance file based on the metric threshold parameters given by the user.

python3 parse_metrics.py -metrics example_metrics.tsv -output_abundances new_abundances.tsv -min_reads 30 -ber_threshold 0.7 -fug_threshold 0.4 -max_for_fug 0.2 -fug_criterion mean

Merging abundance outputs

In the utils subdirectory of this repository, the script merge_abundances.py is available. This script takes as input a folder containing (only) abundance outputs of metapresence.py, and it outputs a tab-separated table where the different samples (abundance outputs) are merged together.

python3 merge_abundances.py -abundance_folder samples/ -output_table merged_table.tsv