biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

Coverage Status Build Status

microprot

microprot is coded in Python 3.x

Introduction

microprot clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.

Install

Requirements

Some of the tools and databases we're using were developed externally and cannot be automatically installed. We ask you to download them on your own, install and update appropriate paths in paths.yml

dbs

tools

Tools requiring manual installation are listed and linked below:

Naming conventions

Filenames

All filenames are in the form: GenomeID_GeneID_ResiduesFrom-ResiduesTo and contain amino acid sequences.
For example, CP003179.1_3319 means gene 3319 from genome CP003179.1 (Sulfobacillus acidophilus DSM 10332), or CP003179.1_3319_1-60 means amino acids 1 to 60 from that gene.

File extensions

Example

Gene CP00000.0_1 (CP00000.0_1.fasta) with 100 residues is run against HHsearch and it returns 2 outputs: CP00000.0_1.out and CP00000.0_1.a3m. Sequence split parameters are:

min_prob: 90.0
min_fragment_length: 10

and the hit list portion of CP00000.0_1.out is:

[7 lines of input parameters summary]

No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
 1 1ABC_A Uncharacterized protein  91.5   0.001   0.001   24.3   0.0   20   10-30    211-231 (260)
 2 1BCD_A Uncharacterized protein  90.3   0.001   0.001   26.4   0.0   55   33-88    28-83  (149)
 3 1CDE_A Uncharacterized protein  85.3     0.2   0.001   26.4   0.0   55   43-98    28-83  (149)

According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from Query_HMM) >= 10).
So CP00000.0_1.match file will contain sequences:

>CP00000.0_1_10-30
EXAMPLEEXAMPLEEXAMPL
>CP00000.0_1_33-88
EXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPL

and CP00000.0_1.non_match will contain sequence:

>CP00000.0_1_89-100
EXAMPLEEXAMP

Sub-sequences CP00000.0_1_1-9 and CP00000.0_1_31-33 will be dropped from subsequent analyses, as they did not match minimum fragment length criteria.