Spine is a program for identification of the conserved core genome of bacteria and other small genome organisms.
Simply move the Spine directory to the desired location. The "scripts" directory must remain in the same directory as spine.pl
Basic command: perl spine.pl -f genome_files.txt
For list of options, call the script without any inputs: perl spine.pl
-f
File with list of input sequence files. Accepted input file formats include fasta sequence files (fasta), genbank sequence + annotation files (gbk), or separate fasta sequence files with corresponding gff3-formatted annotation files (comb). This file should beformatted like so:
path/to/file1<tab>unique_identifier<tab>fasta or gbk or comb
path/to/file2<tab>unique_identifier<tab>fasta or gbk or comb
Example:
/home/seqs/PAO1.fasta PAO1 fasta
/home/seqs/LESB58.gbk LESB58 gbk
The third column (fasta, gbk, or comb) is optional, but should be given if your sequence files end with suffixes other than ".fasta" or ".gbk", or if you are providing sequences with gff3 annotation files, i.e. comb(ined).
If you have genomes spread across multiple files (i.e. chromosomes and/or plasmids), these can be combined by either concatenating the files into one:
cat chrom_I.gbk chrom_II.gbk > combined.gbk
or by including all the files in this input file, separated by commmas:
Example:
/seqs/chrom_I.fasta,/seqs/chrom_II.fasta mygenome fasta
chrom_A.gbk,chrom_B.gbk,plasmid_X.gbk myothergenome gbk
seqA.fasta,seqB.fasta,seqA.gff3,seqB.gff3 genomeAB comb
IMPORTANT: When including multiple files for a strain or joining multiple files within a strain, please ensure that all chromosome/plasmid/contig IDs are unique across files within a single genome. If sequence IDs are duplicated, the results are likely to be wrong.
-a
or --pctcore
Percentage of input genomes in which a region must be found in order to be considered core. (default: 100)
-g
or --maxdist
Maximum distance between core genome segments. Distances less than this between adjacent segments will result in combination of fragments with N's rather than separating into two or more fragments.
(default: 10)
-l
or --license
Print license information and quit
-m
or --nucpath
Full path to folder containing MUMmer scripts and executables, i.e. /home/applications/MUMmer/bin
(default: tries to find MUMmer in your PATH)
-r
or --refs
Reference genome sequence(s) to use as primary output source(s). This should be one or more integers corresponding to the order of the genomes given in the file above, i.e. "1" would use the first-listed sequence, "3" would use the third-listed, etc. To prioritize multiple genome sequences, separate the integers with commas, i.e. "1,3" for giving sequence 1 the highest priority and sequence 3 the next-highest priority. Reference sequences will serve as the source of backbone sequences to be output, as well as the source of backbone locus IDs, if applicable.
The number of reference genomes used will depend on the definition of core genome given by option -a. For instance, if core is determined from 10 input genomes and -a is given as 100, then core sequence will only be taken from one reference genome. If, for example, -a is given as 90 from 10 input genomes, then potentially two reference sequences will be needed: The first for sequences present in all 10 genomes and for sequences present in 9 out of 10 genomes including the first genome. The second reference sequence would then be used as the source of all sequences present in 9 out of 10 genomes, but not present in the first reference genome.
(default: reference priority will be the same as the order of genomes entered, with the first genome having the highest priority and the last genome having the lowest priority)
--mini
Produce only limited output, i.e. just the backbone sequence derived from the reference genome(s). This saves time on large data sets, especially if you only need the backbone sequence to get accessory sequences from AGEnt.
(default: core and accessory sequence sets will be output for all included genomes)
--pangenome
Produce a pangenome sequence and characteristics from sequences in the order given. This option will be ignored if '--mini' option is given.
(default: no pangenome information will be output)
-o
or --prefix
Output prefix.
(default: "output")
-p
or --pctid
Minimum percent identity for regions to be considered homologous.
(default: 85)
-s
or --minout
Minimum size of core region sequences to be output, in bases.
(default: 10)
-t
or --threads
Number of parallel processes to run.
(default: 4)
Careful: This script does not perform any verification of the number of processers available. If you set this number higher than the number of processors you have, performance is likely to be significantly degraded.
-v
or --version
Print version information and quit.
Nucmer Options
Advanced use only. Little reason to change defaults in most situations.
See MUMmer documentation for more information.
--breaklen
Integer (default: 200)
--mincluster
Integer (default: 65)
--diagdiff
Integer (default: 5)
--diagfactor
Float (default: 0.12)
--minmatch
Integer (default: 20)
--nosimplify
(default: simplify)
statistics.txt
First line shows the current software version used.
Second line shows the input parameters given to the software.
Column headers and descriptions:
coords.txt
Coordinates of genome sequences.
".accessory_coords.txt": Accessory genome sequences for the indicated strain
".core_coords.txt": Core genome sequences for the indicated strain
"backbone_coords.txt": Core genome sequences for the group of strains
"pangenome_coords.txt" (if requested): Pangeome sequences for the group of strains
Column headers and descriptions:
*.fasta
Nucleotide sequences of the genome segments output by Spine. Original sources of the sequences can be determined by cross-referencing the sequence IDs with the cooresponding coords.txt file
loci.txt (if annotated genbank file was provided for one or more genomes)
List of coding sequences found in the core genome.
".accessory_loci.txt": Accessory genome coding sequences for the indicated strain
".core_loci.txt": Core genome coding sequences for the indicated strain
"backbone_loci.txt": Core genome coding sequences for the group of strains
"pangenome_loci.txt" (if requested): Pangeome coding sequences for the group of strains
Column headers and descriptions:
__position_counts.txt__
This file should not be needed for routine use. Is meant to be used as input for core_and_pangenome.pl to calculate core-, pan-, and new genome sizes at permutations of the input genomic sequences.
Column headers and descriptions:
Spine Copyright (C) 2016-2018 Egon A. Ozer
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. See LICENSE.txt
Contact Egon Ozer with questions or comments.