ctmrbio / BACTpipe

BACTpipe: An assembly and annotation pipeline for bacterial genomics
https://bactpipe.readthedocs.org
MIT License
20 stars 7 forks source link

InterProScan 5 #94

Closed boulund closed 3 years ago

boulund commented 6 years ago

Would it be interesting to run InterProScan 5 on the protein sequences provided by prokka?

It is a fairly sizeable download (almost 9GB), but it comes with the following databases out of the box:

    TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs
          SFLD (4) : SFLDs are protein families based on Hidden Markov Models or HMMs
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.
    Gene3D (4.2.0) : Structural assignment for whole genes and genomes using the CATH domain structure database
   Hamap (2018_03) : High-quality Automated and Manual Annotation of Microbial Proteomes
     Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins
ProSiteProfiles (2018_02) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
       SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
        CDD (3.16) : Prediction of CDD domains in Proteins
     PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family
ProSitePatterns (2018_02) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
       Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
   ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
  MobiDBLite (2.0) : Prediction of disordered domains Regions in Proteins
      PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

I just thought of it this morning, after hearing from Jonatan that he wanted to search for some PROSITE patterns in his data.

We could consider adding an optional step at the end of the workflow that will detect if InterProScan is installed and just run a default run of InterProScan on the available protein sequences from prokka for every sample.

boulund commented 6 years ago

One thing that could be annoying is that it would probably increase the total runtime of BACTpipe dramatically. I think InterProScan 5 will take quite a while to process a large number of protein sequences.

boulund commented 3 years ago

I think running InterProScan is outside the scope of BACTpipe, closing this. We can reopen if we want to discuss this more in the future.