TSTanabe / HMSS2

HMSS2
GNU General Public License v3.0
5 stars 0 forks source link

HMSS2

HMSS2: a tool for the identification of sulfur metabolism-related genes and analysis of operon structures in genome and metagenome assemblies. The tool searches fasta files for sulfur metabolism associated proteins using hidden markov models and defined threshold scores. Furthermore, the genes of the detected proteins are analyzed for their position in the genome. The detected gene clusters are then named with a keyword if it is a known pattern of a gene cluster. HMSS2 can also be extended with other compatible HMMs.

Installing HMSS2 on Linux

You can install HMSS2 by downloading it directly from GitHub in compiled or non-compiled form.

  1. Download the latest release from github

  2. In a terminal, 'cd' to the downloaded package

  3. Extract the files

  4. Test you can run by './HMSS2/HMSSS -h' for the precompiled version or 'python HMSS2/HMSSS.py -h' for the uncompiled version

  5. Installation of required external programs HMSS2 depends on:

    5.1 Prodigal or sudo apt-get install -y prodigal for translation of nucleotide fasta

    5.2 HMMER3 or sudo apt-get install -y hmmer for the detection and annotation

    5.3 pip install biopython if you are using the uncompiled version

  6. That's it! You can now run HMSS2 on a directory of protein sequence fasta files with gff files or nucleotide fasta files

Running HMSS2

To run HMSS2 on your own data type in the command line:

./HMSSS -f Directory

If an already existing database is to be extended, this can be specified by means of the -db command.

./HMSSS -f Directory -db /path/to/Database.db

Replace "Directory" with the directory containing your input fasta files, with one file per species. The names of the files should match the identifiers of the genomes, since the name of the files will later be used for identification. When using protein fasta files with gff files, the names of the related files should be the same except for the file extension. File names will be used as genome identifiers. HMSS2 will look for input fasta files with any of the following filename extensions:

In case of any present file with .fna extension HMSS2 will try to transcripe to protein fasta via prodigal. Annotation and gene cluster prediction is then automatically performed.

image info

HMSS2 process overview. External programs are Progal and HMMER3. The input can consist of either assemblies with nucleotide sequences or protein sequences in fasta format with corresponding GFF3 files. New features of HMSS2 are marked in yellow.

Fasta and gff file format examples

If protein fasta files and corresponding .gff files are provided the transcription via prodigal is skipped. the headings up to the first blank are used to identify the protein sequence. In the following example of a protein fasta file generated by prodigal, this would be the identifier NZ_CABKUE010000001.1_1.

>NZ_CABKUE010000001.1_1 # 3 # 320 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.553
MMVLEWIRFLAGAVCMAAGLVFYIIQFIGVFRFKYVLNRMHAAAIGDTLGCGLMLLGAVV
FNGFTFPSVKILFLIVFLWMTSPVAAHMVVKLEVLSREKMEDCCPV

The .gff file should have the same gff3 format as generated by prodigal and used by NCBI. In this format, the genome identifier is given in the ninth column with ID=cds-[...]; and separated from other attributes by the ; character. The first column indicates the contig on which the gene is localised. In the following example of a gff file generated by prodigal this would be 1_1.

NZ_CABKUE010000001.1    Prodigal_v2.6.3 CDS 3   320 60.2    -   0   ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.553;conf=100.00;score=60.24;cscore=37.26;sscore=22.98;rscore=15.37;uscore=3.41;tscore=3.01;

Since the identifiers of prodigal itself in protein sequence fasta file and .gff3 file do not match by default it may be necessary to adjust them. If prodigal is used via HMSS2 this is done automatically after the translation by prodigal. An example is shown below

NZ_CABKUE010000001.1    Prodigal_v2.6.3 CDS 3   320 60.2    -   0   ID=NZ_CABKUE010000001.1_1;

If the line end after the identifier no ; character may be required.

Retrieve sequences from database examples

Protein sequences detected by the hidden Markov Models can be retrieved with -fd followed by the names of the desired proteins. All proteins of a named gene cluster can be retrieved with the -fk command followed by the keywords of the desired gene cluster(s). Both commands combined retrieves all sequences matching both, the given keyword(s) and the given domain(s). All these option require the -db command to specify the local database.

If sequences of one or more proteins are to be output from the local database, the options -db and -fd are required. By default, the local database is located in the results folder of HMSS2. In addition, the corresponding path is also output during the initial search in the terminal. The desired protein types are written after the -fd option separated by blanks. the names can also be incomplete. An input of Dsr would therefore output all proteins that have Dsr in their designation (oxDsrA, redDsrA, oxDsrB ...).

./HMSSS -db /path/to/database.db -fd protein_1 protein_2 protein_3

Analogously, all proteins sequences of a certain named gene cluster type can be output. This is specified via -fk.

./HMSSS -db /path/to/database.db -fk cluster_type1 cluster_type2

Result files

The results each search are stored in the local database located at scripts/results in separate project folders. For each search, the respective storage location of the HMSS2 database is output in the terminal. Output from a datebase specified by the -db option is made on request via the -fd or -fk option can be used for this (see also command options). Each request generates a new folder with a unique timestamp that includes all generated files:

Command options

HMSSS also comes with several options which are scribed in the help accessed by -h.

Define HMM library, gene cluster patterns file and other run options (optional)

These options allow you to change the hidden Markov models used and set up custom libraries, thresholds, and reference gene clusters. By default these are set to the HMSS2 library and thresholds.

Result files and output (optional)

Results are stored in a local database which can be accessed to retrieve different results of interest. The local database can also be extended by later searched. If not defined before the search HMSS2 will create a new local database for each run.

Sequence FASTA File output

The output from a database requires the -db option to define the database from which the desired output is taken. As the output is normally a set of sequences from a certain protein, possibly with a defined genomic vicinity or from specified taxonomic group there are several options to limit the number of retrieved sequences. In the result directory for each attemptd to retreive sequences a new folder is created.

Limiting output to certain genomes:

Sequences for proteins can be retrieved and written to fasta files with the following commands. The -fd and -fk options will also automatically generate iTol datasets and presence/absences matrices, identical to the dataset generating options:

Files always contain only one type of protein sequences. The output includes several files with reports of the written sequences and some subsets for proteins. Each file starts with a short summary of the given command, followed by the name of the protein whose sequences were written to the file. Proteins which have more than one domain previously detected by HMSSS are separately written to files and names by the all detected domains. Furthermore two subsets are prepared:

Dataset output without sequences (optional)

Information about the presence of given proteins and/or keywords in a taxon or species can be retrieved and written to tab separated files. This also includes iTol dataset compatible files but sequences will not be retrieved:

The output contains the number for the presence of the desired proteine/keywords at each taxonomy level in absolute and relative values, each normalized to the number of genomes in the given taxonomy level. An iTol binary dataset is also output, with the specified names consisting of the genome identifiers and the taxonomic line

Processing result files (optional)

Protein sequences are written to files with identifiers retrieved from the local database. These FASTA formatted files can be directly used or are the basis for further file generation. HMSS2 comes with further tools to create additional files based on the initial output:

Extending the HMM library and gene cluster patterns

The gene patterns, HMM library and the cutoff score file can be found in the the src directory. New Hidden Markov Models (HMMs) can be generated by hmmbuild function from the HMMER3 package. In this case score cutoff have to be adjusted manually. Alternatively predefined HMMs can be downloaded from databases like pfam or TIGERFAM and added to a library by simply concating the files of each HMM. Cutoff scores are commonly available on the corresponding website. The cutoff scores are listed in the Thresholds file. This file is a four column tab separated file. First column includes the name of the HMM, followed by the standard cutoff score, the noise cutoff and the trusted cutoff. The noise cutoff corresponds to the score of the most dissimilar sequence, therefore the true positive with the lowest score. Using this score results in a very sensitive but unprecise detection. In contrast the trusted cutoff corresponds to the score of the true positive sequence before a false positive hit occurs. Using this score results in a precise but insensitive detection. The standard cutoff is the cutoff score with the best tradeoff between sensitivity and precision. The Threshold file can be appended by new lines in the same format, if custom HMMs are requested:

AprM    40  338.2   35.2
AsrA    237 328.8   162.2
New_HMM 218 220 58.1

To run HMSS2 with additional custom HMMs or HMMs from public databases concat all desired HMMs into a single library file and define this library by the -l command. The score thresholds of all HMMs have to be written in a single tab separated file. The threshold file can be set by the -t command followed by the path to the corresponding threshold file.

The all gene cluster patterns to be recognized have to be stored in a tab separated file. First columns defines the keyword, which is assigned to gene clusters matching the pattern. The following columns define the names of the HMMs to be present in a gene cluster to be recognized. All columns have to be tab separated. Keywords are not required to be unique and the same keyword can match differend patterns:

Sox SoxX    SoxY    SoxZ    SoxA    SoxB    SoxC    SoxD
oxDsr   oxDsrA  oxDsrB  oxDsrC  DsrE    DsrF    DsrH
sHdr    sHdrC1  sHdrB1  sHdrA   sHdrH   sHdrC2  sHdrB2
sHdr    sHdrC1  sHdrB1  sHdrA   sHdrH   sEtfA   sEtfB

The path to a custom gene cluster patterns file can be set with the -p command followed by the corresponding path.

Taxonomy assignment

Taxonomic assignments from the GTDB taxonomy can be performed via the -gtdb command, specifiying the respective tab separated file. Alternativly, a custom tab separated file including the following column headers can be used with the -cutax command:

genomeID superkingdom clade phylum class order family genus species strain taxid biosample bioproject genbank refseq completeness contamination typestrain

With the exception of the first column, not all columns must contain values. The type strain column must contain either the value 0 or 1 to be evaluable. The Contamination and Completeness columns should contain values between 0 and 100.