CCB-SB / plsdb

PLSDB pipeline to collect bacterial plasmids from NCBI
https://ccb-microbe.cs.uni-saarland.de/plsdb/
35 stars 4 forks source link
bacteria ncbi pipeline plasmid

Pipeline for data collection

News

Our manuscript discussing the new features of PLSDB was accepted to the annual 2022 Nucleic Acid Research database Issue! The manuscript can be found here.

Summary

pipeline graph

Preparations

PubMLST data

This data processing pipeline makes use of the PubMLST website developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust.

rMLST data

Note: requires a PubMLST account Note: requires graphical interface Note: PubMLST account needs to request access to Ribosomal MLST locus/sequence definitions database from rMLST admin. Access is normally granted within a day. Error: Message: 'chromedriver' executable needs to be in PATH : Make sure that chromedriver is installed. As chromium already comes with a chromedriver installation, you can try: sudo pacman -Syyu followed by sudo pacman -S chromium

To remove putative chromosomal sequences rMLST analysis is performed which requires rMLST sequences from PubMLST. There is an API for the PubMLST services, however using it seems to require much more effort than downloading the data through a web browser. Thus, there is a rule (retrieve_rmlst_data) that downloads the sequences automatically (given the login data). This rule needs a graphical interface, please, run this rule locally in your computer.

Here, a login and password are required. Please, create and account and specify your credentials in config.yml.

Note: Cookie agreement might cause problems. Requires minor changes if "Got it!" is changed to different link text.

pMLST

There is a mapping from PlasmidFinder IDs to pMLST profile names in pipeline.json (pmlst/map). It may require an update depending on which pMLST schemes are available from PubMLST and which IDs are currently in the PlasmidFinder database.

ABRicate

Please, if the most recent version of ABRicate contains the most recent database links abricate-get_db.

IMPORTANT: Currently, ABRicate (version 1.0.1) does not update some databases correctly:

API keys

NCBI data

To retrieve data from NCBI, please obtain an API and specify it in the config.yml.

Location queries

To map location names to coordinates the Nominatim API and Google API are used. Google API is only used for comparative purposes, as their policy doesn't allow the storage of google's content (more here).Google requires API key, please which requires you register (more info).

BIOSAMPLE_Host

Already known hosts are in hosts_version.csv. Run the rule process_create_host_mapping and manually check the new versions of host mapping. Find more details at the end of the log file of the rule (logs/process_create_host_mapping.log) or in the rule process_manually_inspect_hosts.

BIOSAMPLE_Location

Already known locations are in locations_version.csv and corrections to find some specific locations are display in location_correction_version.csv

Run the rule process_parse_locations and manually check the new versions of location and location_corrections. Find more details in the rule process_manually_inspect_locations.

Conda & Snakemake

If needed, install (mini-)conda

cd ~
# get miniconda (for linux)
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# install
bash Miniconda3-latest-Linux-x86_64.sh
# set path to binaries in your ~/.bashrc
export PATH=$HOME/miniconda3/bin:$PATH

If needed, install snakemake

# If required, install mamba package manager in you base env
conda install  -c conda-forge mamba
# Install snakemake
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Current versions:

Comparing new and old versions

The last rule in the pipeline requires a "master" table from an older version. The path has to be set in config.yml (attribute previous_table) and the file must exist.

Running the pipeline

Groups of execution (sequentially)

References

This data processing pipeline makes use of the PubMLST website developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust.