Micromeda / pygenprop

A python library for programmatic usage of EBI InterPro Genome Properties.
http://pygenprop.rtfd.io/
Apache License 2.0
9 stars 4 forks source link
genome-properties genomics metabolic-models metabolism pathway-analysis

Pygenprop

Branch Status
Master Build Python Package Conda Build Python Package PIP
Develop Build Python Package Conda Build Python Package PIP
Docs Documentation Status

Pygenprop is a python library for programmatic exploration and usage of the EBI Genome Properties database.

Features

At its core, the library contains five major components:

Installation

Pygenprop is compatible with Python 3.6 or higher (3.5 may work, but it is not tested). Requirements can be found in environment.yaml.

To install from PyPi

pip install pygenprop

To install from Conda

conda install -c conda-forge -c lbergstrand pygenprop

To install from source (for development)

cd /path/to/pygenprop_source_dir
pip install .

Acquiring Genome Properties Data

Before Pygenprop can assign genome properties to an organism, it first has to gather information from the Genome Properties database. The easiest way to gain access is through the parsing of a Genome Properties Database release file. This file is found in the EBI Genome Properties Github repository and is called genomeProperties.txt. The file is located in the repository's flatfiles folder. For each release of Genome Properties, a genomeProperties.txt file is generated from the description files of all public genome properties.

Acquiring Release Files

genomeProperties.txt files can be found at URLs in the compatibility section below using a web browser or UNIX commands such as wget or curl. They can also be streamed directly into Jupyter notebooks using the requests python library. Code for streaming the database into a Jupyter notebook can be found here.

Compatibility

Pygenprop will be continually updated to take into account changes in the schema of the Genome Properties database. Below is a compatibility table that maps between Genome Properties and Pygenprop releases.

Genome Properties Release genomeProperties.txt URL Compatible Pygenprop Release
1.1 https://raw.githubusercontent.com/ebi-pf-team/genome-properties/rel1.1/flatfiles/genomeProperties.txt >= 0.6
2.0 https://raw.githubusercontent.com/ebi-pf-team/genome-properties/rel2.0/flatfiles/genomeProperties.txt >= 0.6
Latest https://raw.githubusercontent.com/ebi-pf-team/genome-properties/master/flatfiles/genomeProperties.txt >= 0.6

Accessing Non-public Properties

The ./data folder of the EBI Genome Properties Github repository contains a series of folders with information about both public and non-public genome properties. Each folder contains both a description (DESC) file and a status (status) file. The status file contains information on whether a property is public or not (public: 0 means that a property is not public). One can use these status files to find non-public properties. The description files for these non-public properties can be parsed using the same parser as used for genomeProperties.txt. Each genome property object that results from the parsing of a description file has an object attribute called public which can be set to true or false to designate a property as public or not.

property_one.public = False 

Acquiring Annotation Data

Pygenprop can assign genome properties to an organism from InterProScan annotation TSV files, Genome Properties long-form assignment files (created by the Genome Properties Perl library) or a list of InterPro consortium signature accessions downloaded into a Jupyter Notebook. Pre-calculated InterProScan results for UniProt proteomes and taxonomies can be downloaded (in signature accession list format) from the beta version of the InterPro website.

Example Data

Running InterProScan

InterProScan generates InterProScan annotation TSV files via domain annotation of an organism's proteins. Details and install instructions for InterProScan5 can be found here. For convenience, a Docker container for installing and running InterProScan5 can be found here.

Integrating Protein Sequences

Pygenprop can be used to extract protein sequences that provide evidence for an organism possessing a genome property. To use this feature, the organism's proteome FASTA files that were annotated by InterProScan must be opened and passed to Pygenprop. See the workflow below for more details on using this feature.

Micromeda Files

Pygenprop can generate Micromeda files, which are a new SQLite3-based pathway annotation storage format that allows for the simultaneous transfer of multiple organism's Genome Properties assignments and supporting information. Examples of supporting information include the InterProScan annotations and protein sequences that support assignments. These files allow for the transfer of complete Genome properties Datasets between researchers and software applications.

Usage

Programmatic Property Comparison With Jupyter Notebooks

A typical use case for Pygenprop will involve a researcher seeking to compute and compare Genome Properties between organisms of interest. For example, a researcher may have discovered a novel bacterium and would want to compare its functional capabilities to other bacteria within the same genus. The researcher could start the analysis by opening up a Jupyter Notebook and directly importing pre-calculated InterProScan annotations for novel and reference genomes within the same genus. Below is an example code for comparing virulence genome properties of E. coli K12 and O157:H7.

An interactive Jupyter Notebook with an extended version of this workflow, with outputs for each step, can be found here. Full API documentation is available here.

from sqlalchemy import create_engine

from pygenprop.results import GenomePropertiesResults, GenomePropertiesResultsWithMatches, \
    load_assignment_caches_from_database, load_assignment_caches_from_database_with_matches
from pygenprop.database_file_parser import parse_genome_properties_flat_file
from pygenprop.assignment_file_parser import parse_interproscan_file, \
    parse_interproscan_file_and_fasta_file

# Compare Properties and Steps Across Organisms 
# =============================================

# Parse the flatfile database
with open('properties.txt') as file:
    tree = parse_genome_properties_flat_file(file)

# Parse InterProScan files
with open('E_coli_K12.tsv') as ipr5_file_one:
    cache_1 = parse_interproscan_file(ipr5_file_one)

with open('E_coli_O157_H7.tsv') as ipr5_file_two:
    cache_2 = parse_interproscan_file(ipr5_file_two)

# Create results comparison object
results = GenomePropertiesResults(cache_1, cache_2, 
                                  properties_tree=tree)

# Get properties with differing assignments
differing_results = results.differing_property_results

# Get property by identifier
virulence = tree['GenProp0074']

# Iterate to get the identifiers of 
# child properties of virulence
types_of_vir = [genprop.id for genprop in virulence.children]

# Get assignments for virulence properties
virulence_assignments = results.get_results(*types_of_vir, 
                                            steps=False)

# Get percentages of virulence steps assigned 
# YES, NO, and PARTIAL per organism
virulence_summary = results.get_results_summary(*types_of_vir, 
                                                steps=True, 
                                                normalize=True)

# Analyze InterProScan Annotations and Protein Sequences
# That Support Genome Properties Across Organisms
# ==================================================

# Parse InterProScan files and FASTA files
with open('./E_coli_K12.tsv') as ipr5_file_one:
    with open('./E_coli_K12.faa') as fasta_file_one:
        extended_cache_one = parse_interproscan_file_and_fasta_file(ipr5_file_one, fasta_file_one)

with open('./E_coli_O157_H7.tsv') as ipr5_file_two:
    with open('./E_coli_O157_H7.faa') as fasta_file_two:
        extended_cache_two = parse_interproscan_file_and_fasta_file(ipr5_file_two, fasta_file_two)

# Create results comparison object with InterProScan match information 
# and protein sequences
extended_results = GenomePropertiesResultsWithMatches(extended_cache_one,
                                                      extended_cache_two,
                                                      properties_tree=tree)

# Get lowest E-value matches for each Type III Secretion System component for E_coli_O157_H7.
extended_results.get_property_matches('GenProp0052', sample='E_coli_O157_H7', top=True)

# Get all matches for step 22 of Type III Secretion for E. coli K12. 
extended_results.get_step_matches('GenProp0052', 22, top=False, sample='E_coli_K12')

# Write FASTA file containing the sequences of the lowest E-value matches for 
# Type III Secretion System component 22 across both organisms.
with open('type_3_step_22_top.faa', 'w') as out_put_fasta_file:
    extended_results.write_supporting_proteins_for_step_fasta(out_put_fasta_file, 
                                                              'GenProp0052', 
                                                              22, top=True)

# Create a SQLAlchemy engine object for writing a Micromeda file.  
engine_proteins = create_engine('sqlite:///ecoli_compare.micro')
# Write the results to the file.
extended_results.to_assignment_database(engine_proteins)

# Load results from a Micromeda file with proteins sequences.
assignment_caches_with_proteins = load_assignment_caches_from_database_with_matches(engine_proteins)
results_reconstituted_with_proteins = GenomePropertiesResultsWithMatches(*assignment_caches_with_proteins, 
                                                                         properties_tree=tree)

Command-line Interface (CLI)

The command-line interface of Pygenprop is used primarily for generating and working with Micromeda files. It possesses three sub-commands and is installed when Pygenprop is installed.

usage: pygenprop [-h] {build,merge,info,preprocess} ...

A command-line interface for generating and manipulating Micromeda pathway annotation files.

positional arguments:
  {build,merge,info,preprocess}
                        Available Sub-commands
    build               Generate a Micromeda file containing pathway annotations for one or more genomes. Supporting InterProScan and protein sequence information can also be optionally incorporated.
    merge               Merge multiple Micromeda files into a single output Micromeda file.
    info                Summarize the contents of a Micromeda file.
    preprocess          Replace FASTA header accessions with a numeric identifiers.

optional arguments:
  -h, --help            show this help message and exit

The build command is used to generate Micromeda files. It requires a copy of genomeProperties.txt. InterProScan TSV files are used as input.

pygenprop build -d ./genomeProperties.txt -i *.tsv -o ecoli_genomes_properties.micro

The build command has a -p flag that is used to add protein sequences to the output Micromeda file. With this flag active, Pygenprop searches the FASTA files that were scanned by InterProScan for proteins that support genome property steps and adds them to the output Micromeda file. The FASTA files must be in the same directory as the InterProScan files and share the same basename (e.g., filename without file extension).

data/
├── ecoli_one.faa
├── ecoli_one.tsv
├── ecoli_two.faa
├── ecoli_two.tsv

For the above directory structure the following shell command would be used to generate a Micromeda file that integrates protein sequences:

pygenprop build -d ./genomeProperties.txt -i *.tsv -o ecoli_genomes_properties.micro -p

The merge command is used to merge multiple Micromeda files into a single output Micromeda file. It also requires a copy of genomeProperties.txt.

pygenprop merge -d ./genomeProperties.txt -i *.micro -o merged_ecoli_genomes_properties.micro

The info command is used to get a summary of a Micromeda file's contents.

pygenprop info -i merged_ecoli_genomes_properties.micro

    The Micromeda file contains the following:

    Samples: 2
    Property Assignments: 2572
    Step Assignments: 4644
    InterProScan Matches: 2843
    Protein Sequences: 1887

Documentation

Documentation can be found on Read the Docs.

Trouble Shooting

Please report issues to the issues page.

Licence

Apache License 2.0

Current Contributors

Lee Bergstrand

Past Contributors

N/A