cchauve / plASgraph2

4 stars 1 forks source link

plASgraph2 - Classifying Plasmid Contigs From Bacterial Assembly Graphs Using Graph Neural Networks

Overview

Identification of plasmids and plasmid genes from sequencing data is an important question regarding antimicrobial resistance spread and other One-Health issues. PlASgraph2 is a deep-learning tool that classifies contigs from a short-read assembly as originating either from a plasmid, the chromosome or being ambiguous (i.e. could originate from both, e.g. in the case of a shared repeated contig).

PlASgraph2 is built on a graph neural network (GNN) and analysis the assembly graph (provided in GFA format) generated by an assembler such as Unicycler or SKESA.

drawing

This distribution of PlASgraph2 is provided with a model trained on data from the ESKAPEE group of bacterial pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp., and Escherichia coli). PlASgraph2 is species-agnostic, so the provided trained model can be applied to analyse data from other pathogen species. Alternatively, plASgraph can be trained on a new training dataset (see section Training below).

Installation

PlASgraph2 can be installed from this repository

git clone https://github.com/cchauve/plASgraph2.git

PlASgraph2 is written in Python 3. It has been developed and tested with Python 3.8.10 and the modules listed in the requirements.txt file. All modules can be installed using pip (https://docs.python.org/3.8/installing/index.html), and we strongly recommend to run plASgraph2 using a dedicated python virtual environment (see https://docs.python.org/3.8/library/venv.html).

The environment can be created e.g. using the commands below:

python3 -m venv venv
source venv/bin/activate
pip3 install -r plASgraph2/requirements.txt

PlASgraph2 is using tensorflow and it can use GPU to speed up the computation, which is useful to train plASgraph2. You can test whether your installation is correctly set up by running the following command in your virtual environment:

python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If the command returns an error message or returns an empty list of GPU devices, something is setup incorrectly and plASgraph2 will only use CPU for the computation. In such case, please see https://www.tensorflow.org/install/pip for further installation instructions and troubleshooting.

If you want to limit plASgraph2 to using the CPU, you can use environmental variable CUDA_VISIBLE_DEVICES as follows:

export CUDA_VISIBLE_DEVICES=""

When this variable is set to an empty string, tensorflow will not attempt to use the GPU.

Classification

The input for plASgraph2 consists in a trained model and either a single assembly graph from a single bacterial sample in gzipped GFA (.gfa) format or a CSV file with a list of gzipped GFA files to analyze.

Given a single gzipped GFA file assembly_graph.gfa.gz and a model located in directory model_dir, the contigs of the sample can be classified using the command

python ./src/plASgraph2_classify.py gfa assembly_graph.gfa.gz model_dir output.csv

The result is written in a file output.csv that contains one line per contig, recording its length, plasmid score, chromosome score and final label. Only contigs with length greater than 100 are listed. The length threshold 100 can be controlled in the config file before training.

For example, you can run plASgraph2 on one of the GFA files provided in the directory example:

python ../src/plASgraph2_classify.py gfa SAMN15148288_SKESA.gfa.gz ../model/ESKAPEE_model/ SAMN15148288_SKESA_output.csv

To classify contigs of several samples at once, the input file is a CSV file input.csv, with one line per sample, the first field being the name of the gzipped assembly graph file, the second is not used for classification, and the last field is the name of the sample. All assembly graphs files listed in the file are assumed to be located in the same directory data_dir. The samples contigs can then be classified using the command

python ./src/plASgraph2_classify.py set input.csv data_dir/ model_dir output.csv

As in the previous case, output.csv is a CSV file containing the results for all contigs of all samples.

The directory example contains an example that has been generated by the command

python ../src/plASgraph2_classify.py set SAMN15148288_input.csv ./ ../model/ESKAPEE_model/ SAMN15148288_output.csv

The file generated by plASgraph2 (example/SAMN15148288_output.csv) is a CSV file with five fields: sample name, contig name, contig length, plasmid and chromosome scores (both numbers in [0,1]) and contig label ('plasmid,chromosome,ambiguous,unlabeled').

Training

Training a plASgraph2 model requires (1) assembly graphs in gzipped GFA format for the training samples and (2) a labeling of the training samples contigs as either plasmid, chromosome, ambiguous (contigs that appear in both a plasmid and the chromosome) or unlabeled (typically very short contigs or others where the correct label cannot be determined).

The training input consists of two files:

File paths in the CSV training file are assumed to be relative, with the prefix of the path for each file being provided as a command-line parameter (see example of command-line below). This assumption implies that all GFA and CSV training files are located in the same directory (although they can be located in different subdirectories).

For example, to re-train the ESKAPEE plASgraph2 model, one would run the commands

cd model
git clone https://github.com/fmfi-compbio/plasgraph2-datasets.git
python ../src/plASgraph2_train.py ./config_default.yaml plasgraph2-datasets/eskapee-train.csv plasgraph2-datasets/ ./ESKAPEE_model > ./ESKAPEE_model.log 2> ./ESKAPEE_model.err

Remark. PlASgraph2 has been trained and tested on assembly graphs generated by the assemblers Unicycler and SKESA. The file formats are described in https://github.com/fmfi-compbio/plasgraph2-datasets.

The result is created in the directory ./ESKAPEE_model, while files ./ESKAPEE_model.log, ./ESKAPEE_model.err record the log and possible errors that occurred during training. The model is provided in the file ./ESKAPEE_model/saved_model.pb.

Additional options -g and -l allow respectively to save the assembly graph and all node features as a file in GML format and to generate additional log files.

Visualization

We also provide a script for visualizing plASgraph2 results in the context of assembly graphs. To visualize predicted labels and plasmid scores of a single GFA file, run the following commands in the example directory:

# classification
python ../src/plASgraph2_classify.py gfa SAMN15148288_SKESA.gfa.gz ../model/ESKAPEE_model/ SAMN15148288_SKESA_output.csv
# visualize labels
python ../src/plASgraph2_visualize_graphs.py gfa SAMN15148288_SKESA.gfa.gz SAMN15148288_SKESA_output.csv label SAMN15148288_SKESA_labels.png
# visualize plasmid scores
python ../src/plASgraph2_visualize_graphs.py gfa SAMN15148288_SKESA.gfa.gz SAMN15148288_SKESA_output.csv plasmid_score SAMN15148288_SKESA_scores.png
# visualize labels and plasmid scores together
python ../src/plASgraph2_visualize_graphs.py gfa SAMN15148288_SKESA.gfa.gz SAMN15148288_SKESA_output.csv label:plasmid_score SAMN15148288_SKESA_labels_scores.png

It is also possible to create images for a whole dataset containing many samples. Here is an example using the two GFA files in the example directory:

# visualize three images per GFA file: one with labels, one with plasmid scores and one with both combined
python ../src/plASgraph2_visualize_graphs.py set SAMN15148288_input.csv ./ SAMN15148288_output.csv label,plasmid_score,label:plasmid_score figures/

More information can be obtained by running python ./src/plASgraph2_classify.py -h and python ../src/plASgraph2_classify.py set -h.

Citation

Janik Sielemann, Katharina Sielemann, Broňa Brejová, Tomas Vinar, Cedric Chauve; "plASgraph2: Using Graph Neural Networks to Detect Plasmid Contigs from an Assembly Graph", Frontiers Microbiology, in press, 2023.