EBISPOT / OLS

Ontology Lookup Service from SPOT at EBI
http://www.ebi.ac.uk/ols
Apache License 2.0
96 stars 40 forks source link

Creating a Neo4j instance of an ontology #104

Closed dhimmel closed 8 years ago

dhimmel commented 8 years ago

Introduction

I've always found it's been a real pain for computational biologists to interact with ontologies in Python. The ontology community seems focused on java products and the file formats (obo/owl) are super complex. I think in terms of networks and am most comfortable reasoning over ontologies using networkx or neo4j. So when I stumbled upon this presentation, I was super excited about easily loading ontologies into Neo4j.

Results

I was amazed how easy was to get the Gene Ontology up and running in Neo4j using this codebase and docker:

git clone git@github.com:EBISPOT/OLS.git 
cd OLS
export OLS_HOME=`pwd`/ols-home
mvn clean install
java -Xmx10g -jar -Dspring.profiles.active=go ols-apps/ols-neo4j-app/target/ols-neo4j-app.jar
docker pull neo4j:3.0.3
docker run \
  --publish=7474:7474 \
  --volume=$OLS_HOME/neo4j:/data/databases/graph.db \
  --env=NEO4J_AUTH=none \
  --env=NEO4J_dbms_allowFormatMigration=true \
  neo4j:3.0.3

Subsequently, the Neo4j server was up and running at http://localhost:7474. I was able to run basic Cypher queries such as finding all subterms of regulation of protein transport (GO:0051223):

MATCH path = (n:GO)<-[:SUBCLASSOF*..]-()
WHERE n.obo_id = 'GO:0051223'
WITH nodes(path) AS nodes
UNWIND nodes AS node
RETURN DISTINCT node.obo_id, node.label

Just wanted to thank the OLS team for this amazing development!

Do you allow public access to the neo4j instances hosted by EBI? For example, we host a public instance of our network for drug repurposing called Hetionet at https://neo4j.het.io.

dhimmel commented 8 years ago

If you want to use Neo4j 2.3, change the Docker commands to:

docker pull neo4j:2.3.5
docker run \
  --publish=7474:7474 \
  --volume=$OLS_HOME/neo4j:/data/graph.db \
  --env=NEO4J_AUTH=none \
  --env=NEO4J_ALLOW_STORE_UPGRADE=true \
  neo4j:2.3.5
dhimmel commented 8 years ago

Here's a version of the subterm query that reports the

  1. (minimum) number of relationships (min_paths) and
  2. the total number of paths (n_paths)

to the specified node:

MATCH path = (n:GO)<-[:SUBCLASSOF*..]-()
WHERE n.obo_id = 'GO:0051223'
WITH nodes(path) AS nodes, n
UNWIND nodes AS node
WITH DISTINCT node AS node, n
RETURN 
  node.obo_id AS identifier,
  node.label AS name,
  length(shortestPath((n)<-[:SUBCLASSOF*..]-(node))) AS min_depth,
  size((n)<-[:SUBCLASSOF*..]-(node)) AS n_paths
ORDER BY min_depth, name

Very cool!

dhimmel commented 8 years ago

Announcing the Hetontology Project

I created a repository to import several ontologies into a compressed Neo4j database. The goal is to allow individuals an easy path to deploy an ultimate Neo4j ontology instance. I hope to add configurations for more ontologies going forward.

I submitted an initial pull request at https://github.com/greenelab/hetontology/pull/1. We'd be delighted if anyone from this repository would be willing to review the pull request and provide feedback or general advice.

Also let us know if there are any specific ways you'd like to be referenced for your great contribution.

LLTommy commented 8 years ago

Hi, Thank you for you post!

No, we don't allow direct access to neo4j database. However, we hope to expose all necessary data through the API. I put in some work to create a python client for OLS, which is not finished or perfect yet - but should allow people to use OLS with python. However, this client is using the OLS API, so I am not sure if that is what you are looking for.

For people that want more functionality, we hope that the open source code and the documentation is enough to get you started to set up a local instance. Obviously, you managed to do that.

Just from reading your last post (not the code), I am not sure I understand what the goal of your new project is. Isn't OLS (or parts of it) an neo4j ontology instance? So you want it to make it easier for people setting up neo4j, using the OLS ontology 'import'?

dhimmel commented 8 years ago

I put in some work to create a python client for OLS, which is not finished or perfect yet - but should allow people to use OLS with python

@LLTommy, I think this will be valuable and will be great for people who know python but not cypher. Let me know when it's available! What I like about the public Neo4j instance is the versatility of Cypher (you can perform most hetnet queries efficiently), the diverse language support, and the Neo4j Browser for immediate access.

I am not sure I understand what the goal of your new project is.

I want to provide the following utility, by building on top of the ols-neo4j-app:

  1. A public read-only Neo4j instance of ontologies. Our experience hosting https://neo4j.het.io should come in handy.
  2. Neo4j Browser guides to help users learn the data model and Cypher language. As examples, see the Hetionet startup guides and drug repurposing guides (go here and click run).
  3. Inclusion of more ontologies — perhaps all active OBO Foundry ontologies.
  4. Easy local deployment of the Neo4j database enabled by through Docker and a prebuilt database archive (hetontology.db.tar.xz).
  5. Potential inclusion of Gene Ontology annotations (may be out-of-scope).
  6. Continuous integration to keep the database up-to-date.

I think these utilities will serve a growing user base as Cypher and hetnets become major technologies in bioinformatics. I anticipate the project will not take too long, but would really benefit from involvement of any interested OLS contributors -- even if just for code review, feedback, and answering ontological questions. Recognizing the contributions of everyone involved will be a priority of the project.

simonjupp commented 8 years ago

This sounds cool. I can also make the full OLS neo4j db available on our ftp. This is updated nightly and includes the full set of OBO library ontologies. I also like the Neo4j browser guide idea! Keep us posted on any developments.

dhimmel commented 8 years ago

I can also make the full OLS neo4j db available on our ftp. This is updated nightly and includes the full set of OBO library ontologies.

That would be awesome and prevent any duplication of effort. I didn't realize you were including ontologies beyond the ones with configuration files.

Keep us posted on any developments.

Will do!

simonjupp commented 8 years ago

Yes, the config files provided are just some examples for wanting to experiment.

The full OLS system is also able to read in the OBO library YAML config file (http://www.obofoundry.org/registry/ontologies.yml), so we use this to pull in a whole bunch of ontologies on the live site http://www.ebi.ac.uk/ols/ontologies

dhimmel commented 8 years ago

@simonjupp nice setup. Being able to access the compressed database on the FTP site would be a HUGE convenience!

simonjupp commented 8 years ago

There's a copy of the neo4j db here for you to try, can you let me know how you get on?

ftp://ftp.ebi.ac.uk/pub/databases/spot/ols/neo4j/

dhimmel commented 8 years ago

@simonjupp 🎆 !

I'm downloading the database using a shell script:

URL=ftp://ftp.ebi.ac.uk/pub/databases/spot/ols/neo4j/ols-neo4j-29-07-16.tar.gz
mkdir ols-neo4j.db
curl $URL | tar --extract --gzip --strip-components=1 --directory=ols-neo4j.db

I'm on slow wifi, so I haven't completed the download, but I was surprised by the large file size (7.4 GB). When I created an xz-compressed database containing 5 ontologies, the file was only 31 MB. I think your large file size is due to log files that aren't essential for copying the database. For example, I see a lot of neostore.transaction.db files, which can be deleted if the server isn't running.

When you create the gzip file would it be possible to ignore files that start with neostore.transaction.db or messages.log? If you're using the tar command line utility, there should be an easy way to ignore files that match given patterns. I think that would make the file size considerably smaller.

LLTommy commented 8 years ago

Good point, we'll have to look into this - and we will.

simonjupp commented 8 years ago

Rebuilt tarball without the transaction files, it's still 5.7GB. This is over 150 ontologies and some of them are quite hefty.

dhimmel commented 8 years ago

I'm not sure how much you are constrained by EBI convention for FTP files. My recommendations are to:

dhimmel commented 8 years ago

Thanks everyone for the help. I'm going to close this issue to keep the issues list clean.

However, we'll make sure to update this Issue with Hetonology progress.