Project Overview
Python modules to generate BEL resource documents.
See wiki for information about dataset objects and how to add new datasets.
Resource Generator
NOTE: Further details on resource generation in this Guide
To run:
'./gp_baseline.py -n [dir]'
'[dir]' is the working directory to which data will be downloaded and new files generated.
gp_baseline.py runs in several phases:
- data download
- data parse (and save as pickled objects)
- build '.belns' (namespace) files
- removed
- build '.beleq' (equivalence) files
The pipeline can be started and stopped at any phase using the '-b' and '-e' options. This enables re-rerunning the pipeline on stored data downloads and pickled data objects.
- gp_baseline.py - acts as the driver for the resource-generator.
- configuration.py - Configures the datasets to be included in the resource-generation pipeline, including initialization of the dataset objects, specification of a download url, and association with a parser
- parsers.py - contains parsers for each dataset.
- parsed.py - acts as a storage module. Takes the data handed to it by
the parser and stores it in a DataObject. Currently all of the data being
used in this module is being kept in memory. See bug tracker about a
possible solution to this memory constraint.
- datasets.py - each DataObject class
is defined in this module. See wiki for information about DataObject classes, methods, and attributes.
- equiv.py - this module will take a DataObject as
a parameter, and use that object's defined functions to generate the new
.beleq files.
- common.py - defines some common functions used throughout the program,
namely a download() function and a function that will open and read a
gzipped
file.
- constants.py - any constants used throughout the program are defined
in this module.
- rdf.py - loads each pickled dataset object generated by Phase II of gp_baseline and generates triples for each namespace 'concept', including id, preferred label, synonyms, concept type, and equivalences.
- belanno.py - generates 'belanno' files outside of the main gp_baseline pipeline (gp_baseline does download and create pickled data objects for the annotation data sets).
Change-Log
- change_log.py - a separate module from gp_baseline. This module uses two sets ('old' and 'new') of pickled data objects generated by gp_baseline.py. change_log.py
outputs a json dictionary mapping old terms to either their replacement
terms or the string
withdrawn
. This dictionary can be consumed by an update
script to resolve lost terms in older versioned BEL documents.
Resource files
These scripts are used to generate additional resource files - see openbel-framework-resources
- orthology.py - creates the gene-orthology.bel file; requires the pickled data objects from the gp_baseline run.
- gene_scaffolding.py - creates the gene_scaffolding_document_9606_10090_10116.bel; requires HGNC, MGI, and RGD '.belns' files generated from the gp_baseline run.
- go_complexes_to_bel.py - creates a '.bel' file with statements mapping Gene Ontology (GO) complexes to their human, mouse, and rat complex components based on data from GO. Uses the 'testing' version of the GOCC complexes '.belns' file and the current gene association files from GO. Output not currently used for openbel-framework-resources.
Dependencies
- To run these Python scripts, the following software must be installed:
- Python 3.x - modules are written in Python 3.2.3
- lxml - used to parse various XML documents.
- rdflib - used by rdf.py