Database framework with RESTful API for aggregating genomic, structural, and functional data for target protein families.
The code is built using the Flask a Python web framework and SQLAlchemy - an object-relational mapper which maps between Python objects and SQL databases.
The database is generated using a series of scripts which gather in data from various public web resources. The first script to run is DoraInit.py, which initializes the necessary files and directory structure for a new database. This should be followed by DoraGatherUniProt.py, which retrieves a set of UniProt entries defined by a given search term. Subsequent scripts add in data from various other databases such as the PDB, NCBI Gene, cBioPortal, and BindingDB. Finally, DoraCommit.py is used to ensure that all gather scripts have been run since the last commit; if this passes, the new data is committed, meaning it can then be exposed through the API.
A frontend web client is currently in development.
First install Anaconda - a free and awesome Python distribution for scientific and data-intensive applications.
conda config --add channels http://anaconda.org/choderalab
conda install targetexplorer
A "crawl number" is iteratively assigned for each pass through the database generation process, from DoraGatherUniProt.py to DoraCommit.py. If the process is completed successfully, DoraCommit.py will update the "safe crawl number", which tells the API to work with only the data corresponding to that crawl number. The number of crawls to store in the database can be defined by the user, and is set by default to 5. Older crawls are deleted by DoraCommit.py.