Russel88 / CRISPRCasTyper

CCTyper: Automatic detection and subtyping of CRISPR-Cas operons
https://typer.crispr.dk
MIT License
89 stars 16 forks source link
bioinformatics cas crispr crispr-analysis crispr-cas crispr-cas9

Project Status: Active - The project has reached a stable, usable state and is being actively developed. Conda

CRISPRCasTyper

Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence.

CRISPRCasTyper and RepeatType are also available through a webserver

This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. Furthermore, it finds CRISPR arrays with minced and by BLASTing a large suite of known repeats, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. It then connects the Cas operons and CRISPR arrays, producing as output:

It includes the following 50 subtypes/variants (find typing scheme here):

It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays

in vector graphics format for direct use in scientific manuscripts

Citation

Jakob Russel, Rafael Pinilla-Redondo, David Mayo-Muñoz, Shiraz A. Shah, Søren J. Sørensen - CRISPRCasTyper: Automated Identification, Annotation and Classification of CRISPR-Cas loci. The CRISPR Journal Dec 2020

Find a free to read version on BioRxiv

Table of contents

  1. Quick start
  2. Installation
  3. CRISPRCasTyper - How to
  4. RepeatType - How to
  5. RepeatType - Train
  6. Troubleshoot

Quick start

conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper
conda activate cctyper
cctyper my.fasta my_output

Installation

CRISPRCasTyper can be installed either through conda or pip.

It is advised to use conda, since this installs CRISPRCasTyper and all dependencies, and downloads the database in one go.

Conda

Use miniconda or anaconda to install.

Create the environment with CRISPRCasTyper and all dependencies and database

conda create -n cctyper -c conda-forge -c bioconda -c russel88 cctyper

pip

If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, minced, grep, sed) in your PATH you can install with pip

Install cctyper python module

python -m pip install cctyper

Upgrade cctyper python module to the latest version

python -m pip install cctyper --upgrade

When installing with pip, you need to download the database manually:

# Download and unpack
svn checkout https://github.com/Russel88/CRISPRCasTyper/trunk/data
tar -xvzf data/Profiles.tar.gz
mv Profiles/ data/
rm data/Profiles.tar.gz

# Tell CRISPRCasTyper where the data is:
# either by setting an environment variable (has to be done for each terminal session, or added to .bashrc):
export CCTYPER_DB="/path/to/data/"
# or by using the --db argument each time you run CRISPRCasTyper:
cctyper input.fa output --db /path/to/data/

CRISPRCasTyper - How to

CRISPRCasTyper takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions

Activate environment

conda activate cctyper

Run with a nucleotide fasta as input

cctyper genome.fa my_output

If you have a complete circular genome (each entry in the fasta will be treated as having circular topology)

cctyper genome.fa my_output --circular

For metagenome assemblies and short contigs/plasmids/phages, change the prodigal mode

The default prodigal mode expects the input to be a single draft or complete genome

cctyper assembly.fa my_output --prodigal meta

Check the different options

cctyper -h

Output

If run with --keep_tmp the following is also produced

Notes on output

Files are only created if there is any data. For example, the CRISPR_Cas.tab file is only created if there are any CRISPR-Cas loci.

Plotting

CRISPRCasTyper will automatically plot a map of the CRISPR-Cas loci, orphan Cas operons, and orphan CRISPR arrays.

These maps can be expanded (--expand N) by adding unknown genes and genes with alignment scores below the thresholds. This can help in identify potentially un-annotated genes in operons. You can generate new plots without having to re-run the entire pipeline by adding --redo_typing to the command. This will re-use the mappings and re-type the operons and re-make the plot, based on new thresholds and plot parameters.

The plot below is run with --expand 5000

RepeatTyper - How to

With an input of CRISPR repeats (one per line, in a simple textfile) RepeatTyper will predict the subtype, based on the kmer composition of the repeat

Activate environment

conda activate cctyper

Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line.

repeatType repeats.txt

Output

The script prints:

Notes on output

Updated RepeatTyper models

The CCTyper webserver is crowdsourcing subtyped repeats and includes an updated RepeatTyper model based on a much larger set of repeats and contains additional subtypes compared to the curated RepeatTyper model. This updated model is automatically retrained each month and the models can be downloaded here.

From version 1.4.0 and onwards of CCTyper the newest repeatTyper model is included upon release of the version.

Each model contains a training report (xgb_report), where you can find the training log, and in the bottom the accuracy, both overall and per subtype.

Use new model in CRISPRCasTyper

Save the original database files:

mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab
mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model

Move the new model into the database folder

mv repeat_model/* ${CCTYPER_DB}/
CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction!

RepeatTyper - Train

You can train the repeat classifier with your own set of subtyped repeats. With a tab-delimeted input where 1. column contains the subtypes and 2. column contains the CRISPR repeat sequences, RepeatTrain will train a CRISPR repeat classifier that is directly usable for both RepeatTyper and CRISPRCasTyper.

Train

repeatTrain typed_repeats.tab my_classifier

Use new model in RepeatTyper

repeatType repeats.txt --db my_classifier

Use new model in CRISPRCasTyper

Save the original database files:

mv ${CCTYPER_DB}/type_dict.tab ${CCTYPER_DB}/type_dict_orig.tab
mv ${CCTYPER_DB}/xgb_repeats.model ${CCTYPER_DB}/xgb_repeats_orig.model

Move the new model into the database folder

mv my_classifier/* ${CCTYPER_DB}/
CRISPRCasTyper and RepeatTyper will now use the new model for repeat prediction!

Troubleshoot

Running out of memory

Large metagenomic assemblies with many small contigs can exhaust the RAM on your laptop. Fortunately, as metagenomic contigs are analysed separately (when run with --prodigal meta) a simple solution is to split the input into smaller chunks (e.g. with pyfasta)