SMBGC Annotation using Neural Networks Trained on Interpro Signatures
Tool for identifying biosynthetic gene clusters (BGCs) in genomic & metagenomic data
Requires:
conda create -n sanntis sanntis
conda activate sanntis
conda activate sanntis
sanntis test/files/BGC0001472.fna
conda deactivate sanntis
SanntiS can be executed using preprocessed InterProScan outputs along with a GenBank (GBK) file specifying the coding sequences (CDSs). This integration facilitates a streamlined analysis pipeline for bioinformatics applications, allowing for enhanced functionality and user flexibility.
conda activate sanntis
sanntis --ip-file test/files/BGC0001472.fna.prodigal.faa.gff3 test/files/BGC0001472.fna.prodigal.faa.gb
conda deactivate sanntis
bash ./get_ips_slim.sh
sanntis_container.py --help
sanntis_container.py [OPTIONS] ARGUMENTS
docker -it --entrypoint bash -v <path to SanntiS/docker>/data/:/opt/interproscan quay.io/repository/microbiome-informatics/sanntis
sanntis --help
sanntis [OPTIONS] ARGUMENTS
GFF3 format file
The fields in this header are as follows:
seqname: SeqID of contig, as in prodigal output.
source: sanntis version.
feature: Feature type name, i.e. CLUSTER, CLUSTER_border, CDS.
start: Start position of feature
end: End position of feature
score: empty
strand: empty
frame: empty
attributes:
ID: ordinal ID for the cluster, beginning with 1.
nearest_MiBIG: MiBIG accession of the nearest BGC to the cluster in the MIBIG space, measured in Dice dissimilarity coefficient.
nearest_MiBIG_class: BGC class of nearest_MiBIG.
nearest_MiBIG_diceDistance: Dice dissimilarity coefficient between ID and nearest_MiBIG.
score: Post-processing probability output.
partial: Indicates if a CLUSTER is at the edge of the contig. First and second digits represent 5' and 3' end, respectively. Same as in prodigal's `partial`. "0" shows the cluster is not at the edge, whereas a "1" indicates is at that edge, (i.e. a partial cluster).
Sample:
##gff-version 3
DS999642 SanntiSv0.9.0 CLUSTER 1 136970 . . . ID=DS999642_sanntis_1;nearest_MiBIG=BGC0001397;nearest_MiBIG_class=NRP Polyketide;nearest_MiBIG_diceDistance=0.561;partial=10
SanntiS prioritises seamless integration with various downstream analysis tools, leveraging a GFF3 file output for broad compatibility. In addition, one of the key features in this regard is the ability to generate an output compatible with antiSMASH, a widely used tool in the BGC analysis ecosystem.
--antismash_output
OptionSanntiS has an --antismash_output
option. This option allows you to create a JSON file formatted according to the specifications of antiSMASH.
sanntis --antismash_output True test/files/BGC0001472.fna
Executing the command above produces a file named with the suffix antismash.json
facilitating its use in antiSMASH for enriched analysis. Specifically, this file can be uploaded to the antiSMASH web server under 'Data input' > 'Upload extra annotations', allowing for an integrated analytical approach that leverages external annotation data.
If you use SanntiS make sure to cite the publication Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS
Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS
Santiago Sanchez, Joel D. Rogers, Alexander B. Rogers, Maaly Nassar, Johanna McEntyre, Martin Welch, Florian Hollfelder, Robert D. Finn
bioRxiv 2023.05.23.540769; doi: https://doi.org/10.1101/2023.05.23.540769