arpcard / rgi

Resistance Gene Identifier (RGI). Software to predict resistomes from protein or nucleotide data, including metagenomics data, based on homology and SNP models.
Other
335 stars 78 forks source link

variants data, blast/diamond build incompatibility #156

Closed acvill closed 3 years ago

acvill commented 3 years ago

This is less of an issue and more of a warning to other users. I am using the conda build of rgi v5.2.0 to annotate resistance genes in metaSPAdes contigs. The instructions on loading the reference data advise to pull the latest dataset, which, at the time of writing this, is version 3.1.2.

wget https://card.mcmaster.ca/latest/data

For metagenomic analyses, the README also recommends using the Prevalence, Resistomes, & Variants data to annotate the CARD database, and provides instructions to download the latest version (v 3.0.8).

wget -O wildcard_data.tar.bz2 https://card.mcmaster.ca/latest/variants
mkdir -p wildcard
tar -xjf wildcard_data.tar.bz2 -C wildcard
gunzip wildcard/*.gz

After following the steps to create a wildCARD annotation and running rgi load --wildcard_annotation to load a local database, I found that rgi main would throw errors at the alignment step, regardless of the alignment tool selected.

With -a DIAMOND:

Error: Incompatible database version

With -a BLAST:

BLAST Database error: Error pre-fetching sequence data

A closer look at the information provided on the CARD data download page suggests that these errors may be due to build incompatibilities between CARD database v3.1.2 and variants database 3.0.8.

January 2021 release - Addition of genomic islands from Islandviewer 4. 221 pathogens, 10272 chromosomes, 1872 genomic islands, 22692 plasmids, 95059 WGS assemblies, 213809 alleles based on sequence data acquired from NCBI on August 13, 2020 and Islandviewer 4, analyzed using RGI 5.1.0 (DIAMOND homolog detection) and CARD 3.1.0. Includes pre-compiled k-mer classifier data for pathogen-of-origin prediction.

Indeed, rgi main does not throw these errors when I use CARD v3.1.0. Here's my workflow, including conda environment setup:

# create rgi conda environment
source /home/miniconda3/bin/activate
conda create --name rgi5.2.0
conda activate rgi5.2.0
conda install --channel bioconda rgi=5.2.0

# download CARD data v3.1.0
carddb=/workdir/CARD
mkdir -p $carddb
cd $carddb
wget https://card.mcmaster.ca/download/0/broadstreet-v3.1.0.tar.bz2
tar -xvf broadstreet-v3.1.0.tar.bz2 ./card.json
rgi card_annotation -i card.json > card_annotation.log 2>&1

# download variants (wildCARD) data v3.0.8 and annotate database
wget https://card.mcmaster.ca/download/6/prevalence-v3.0.8.tar.bz2
mkdir -p wildcard
tar -xjf prevalence-v3.0.8.tar.bz2 -C wildcard
rgi wildcard_annotation -i wildcard --card_json card.json -v 3.0.8 > wildcard_annotation.log 2>&1

# load local database and run rgi
cd $data
export OMP_NUM_THREADS=12
rgi load --wildcard_annotation $carddb/wildcard_database_v3.0.8.fasta \
    --wildcard_index $carddb/wildcard/index-for-model-sequences.txt \
    --card_annotation $carddb/card_database_v3.1.0.fasta \
    --local
rgi main \
    -i contigs.fasta \
    -t contig \
    -a DIAMOND \
    -n $OMP_NUM_THREADS \
    -o test
raphenya commented 3 years ago

@acvill Thanks for the comments. There are two major modules in RGI called rgi main and rgi bwt.

rgi main

rgi bwt