Identify database technology

tschaffter commented 5 years ago

Here are two open-source graph database technologies that would be suitable to represent provenance graphs:

https://graphql.org/
https://neo4j.com/ (Kenny used it in the past)

More are available which I have never heard of before: https://en.wikipedia.org/wiki/Graph_database

GraphQL and neo4j often appears on top of results like "the 10 best graph databases". The page below compares the Node implementation of the two database services: https://npmcompare.com/compare/graphql,neo4j

GraphQL would definitively be the one to pick between the two (289,982 daily downloads vs 130, updated every 17 days on average vs 3 mo, etc.).

First prototype of PHC Collaboration Portal

@jaeddy Unless you have experience with GraphQL, an alternative is to use any database we are familiar with for the mid-April prototype. For a second or third prototype, it would definitively bring value to show that users can search the graph using pattern like "Dataset in format X > State generated by Tool Y > Output in format Z".

tschaffter commented 5 years ago

https://medium.freecodecamp.org/how-to-set-up-a-graphql-server-using-node-js-express-mongodb-52421b73f474

jaeddy commented 5 years ago

@tschaffter I made some changes to the application yesterday, now using mongoengine instead of a more native MongoDB client (pymongo). This makes it more difficult to use the pre-built model classes generated from the OpenAPI yaml — not too bad, just a little more work to make sure everything stays in sync. Fortunately the interface for defining/schematizing documents with mongoengine is pretty friendly.

I've gotten a basic create_activity method working, but the next step will be to look at the graphene-mongo library to potentially hook up a graph API endpoint.

We can sync up next week on how to get the two systems talking to each other. Feel free to clone/fork and play around with the code as well (though I know you're busy with other things)!

tschaffter commented 5 years ago

We have decided to go for GraphQL as the project is much more active than neo4js (see https://npmcompare.com/compare/graphql,neo4j).

jaeddy commented 5 years ago

Comments from @Pawel-Madej:

1. MongoDB + graphene-mongo (Python) + GraphQL

Mongo works perfect when storing data in JSON format, but no relations between data can be reflected. I used simple model (Nobel prize data, available here: http://api.nobelprize.org/v1/) to stretch test this approach. Model covers Prize, Laureate and Country entities. Relation between entities are: Laureates -> Prizes -> Countries; Prizes -> Laureates -> Countries.

Observation & Comments:

model looks perfect in MongoDB, but when I was trying to get the same view on those data using GraphQL, I got an error. Works good for <=-1 level data, ie. Laureates -> Prizes, but didn't work for deeper levels (got an list error), which I wasn't able to solve. So, in my case -2 entity (Countries) I was only able to represent collection as a StringField() .

another problem that I've got were counts - in the graphiql I wasn't able to get simple count on eg. number of Nobel prizes in the single year. What's more filter was not working for IntField() value, was working only for StringField().

My current experience on that is that MongoDB works perfect for storing JSON/XML format data, but it's really hard to perform any kind of activities (like analysis / filtering / consuming) on stored data. Aggregations are almost impossible. What's more hierarchical data (>=-2 levels) caused on my side problems that I was not able to solved in Python. Finally relationships between data entities were not supported in this approach - to be determined whether we need those?

in this approach you do have to represent data model in the python code (ie. schema, model). Any changes on the mongo side needs to be reflected in the python code, on the application side.

2. Neo4j + Cypher +py2neo +APOC

Neo4j is graph database, dedicated to store data in closed, own neo4j format. I used movies DB (see B.2.1 Basic example -> https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/) to stretch this approach.

Observation & Comments:

dataset, used by me in this experiment, is tiny, so it's really hard to say anything about performance of this solution.

I was not able to open web-based application to handle neo4j interactions, because I've used AWS environment to run neo4j database and port 7474 (that allows to interact with neo4j db) is blocked on RSI. Which means in practice that I need to use command line to perform any actions - and it appeared that neo4j is very limited in this area - MongoDB was much more powerful.

information models looks as follows: MOVIES -> [roles] <- ACTORS.

Neo4j keeps data in their own format, so you need to import any data-set, that you want to interact with. Without web-based app it's not so easy to import any data. I've used neo4j-admin command line procedure and CSV files to perform this action.

in my opinion the biggest advantage of this solution is how neo4j handles data relationships. It's simple and works really easy and smoothly. What was impossible in mongo, here was standard (means by designed), out-of-the box functionality.

any type of querying data, getting aggregates, matches, filtering, etc. - its efficient, easy and powerful. I've got even to the article that describes how potentially data scientists can use it for analysis purpose: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

it is also easy to visualize data using. eg. d3.js library. I will share sample code on github here:

the most efficient way to use Neo4j and Cypher is APOC library (Awesome Procedures on Cypher) - set of 500+ functions for Cypher and Neo4j, including import functions like apoc.load.xls, apoc.load.csv, apoc.load.json, etc.

to integrate it with Python you need py2neo framework. And here where the problems starts to appear 🙂. This python framework doesn't cover 1:1 the APOC, which means that not all of the APOC 500 functions will work for py2neo. I wasn't able to establish the difference, so it might appear that when running a standard Cypher APOC function it won't be possible to execute it from Python.

to run my Python app I've used bottle web-server, which was working on port 8080.

My Neo4j sample app works here: http://10.158.43.79:8080/ (works only from RCN)

one of the advantages is that you don't have to represent data model in the Python / application code. You just run queries which are executed directly on the database side.

tschaffter commented 5 years ago

The Neo4j sample app works on Chrome but not on FF (66.0.4 (64-bit)).

This page compares different aspects of GraphQL and Neo4j (Disclaimer: I don't have expertise in graph databases or graph query languages): https://npmcompare.com/compare/graphql,neo4j

The result of the comparison strongly concludes that GraphQL has a much more active community and has a better support than Neo4j.

@Pawel-Madej You reported issues with GraphQL/MongoDB that seem to originate from MongoDB. Is MongoDB the database usually used to store objects queried through GraphQL?

jaeddy commented 5 years ago

@tschaffter — I think comparing Neo4j to GraphQL is a bit more "apples vs. oranges" than one might expect. From this post:

What’s the relation between GraphQL and graph databases? Not much, really, GraphQL doesn’t have anything to do with graph databases like Neo4j. The “graph” part comes from the idea of crawling across your API graph by using fields and subfields; while “QL” stands for “query language”.

So, while GraphQL can be a powerful tool to more efficiently retrieve specific pieces of data across multiple APIs/paths, Neo4j is a framework specifically designed to store and serve data that is relationship-centric. Combined with (what seems to be) a more expressive and functional query syntax in Cypher, Neo4j seems like the optimal choice to me.

In terms of browser compatibility, we should be able to find more lightweight libraries for visualization (or, alternatively, just use a Neo4j client library to execute queries and then visualize with D3.js or some other library).

tschaffter commented 5 years ago

@jaeddy @Pawel-Madej @lukasz-rakoczy What about using Amazon Neptune as the graph database? https://aws.amazon.com/neptune/

lukasz-rakoczy commented 5 years ago

Although usually I prefer to use managed services that fit into the cloud ecosystem where the application is hosted in this case I think Neptune is the best for us.

Here are some downsides of Neptune:

It seems that there are no easy options to visualize data stored in Neptune. From what I've been able to find there are some commercial solutions provided by AWS partners but it might not be easy to evaluate them.
Neptune pricing says it takes ~$300 per month to run a single basic instance of the DB.
The solution will be AWS locked because we can't deploy Neptune on other clouds.

Now, when I know more on what we are trying to achieve I think that a standalone Neo4J instance (even in the community version) would be a good fit for the provenance service (even without custom "adapter" service that we have right now written in Python):

It exposes HTTP API that can be used to feed the database with data from the Collaboration Portal service
It can be directly use from the D3 library to visualize graphs and query the data.
We can run multiple instance for the cost of EC2 running them and also run the DB locally during development.
There are libraries for almost all modern programming languages for accessing Neo4J.

There are also some limitations of Neo4J we need to consider:

Neo4j Enterprise is expensive. I'm not sure how much exactly it is but whenever I search for this information I get number close to those listed here: https://blog.igovsol.com/2018/01/10/Neo4j-Commercial-Prices.html
Neo4j instances will need to maintained by us. There are providers serving Neo4j as a managed service but I think they use the enterprise version so you need to pay the original licence plus the vendor cost.
Neo4j Community security model is limited - we need to know if it is enough for our needs or we need custom development to use it.

Maybe it would be worth trying to run Neo4j instance, feed it with data using its HTTP API and then see how the data can be visualized with D3? @tschaffter @jaeddy @Pawel-Madej what do you think?

tschaffter commented 5 years ago

@lukasz-rakoczy Thanks for this insight. This is great!

System Properties Comparison Amazon Neptune vs. Neo4j: https://db-engines.com/en/system/Amazon+Neptune%3BNeo4j

@lukasz-rakoczy wrote:

It can be directly use from the D3 library to visualize graphs and query the data. There are libraries for almost all modern programming languages for accessing Neo4J.

The fact that neo4j is more mature and has a larger community than Amazon Neptune (release in 2017) make me more comfortable using it as several components that we need are probably available somewhere.

Addressing #6 should help use implementing different components related to provenance.

tschaffter commented 5 years ago

@jaeddy @Pawel-Madej Have we settled on adopting Neo4j + Cypher +py2neo +APOC?

jaeddy commented 5 years ago

@tschaffter I think yes to Neo4j and Cypher. I'm using py2neo to get a proof-of-concept working, but I'm not sure if Python is the ideal long term solution. I'm hoping to share more representative examples with @lukasz-rakoczy and @Pawel-Madej within the next week to get their feedback on the best path forward.

Pawel-Madej commented 5 years ago

@tschaffter I think yes to Neo4j and Cypher. I'm using py2neo to get a proof-of-concept working, but I'm not sure if Python is the ideal long term solution. I'm hoping to share more representative examples with @lukasz-rakoczy and @Pawel-Madej within the next week to get their feedback on the best path forward.

@jaeddy I was able to create a demo example (I've used the movie concept, sorry for that) available here: http://10.158.43.79:8080/ using Neo4j+Cypher, Python and Bottle (python web framework).

I will share shortly code and description of the basic functionality, but works great for both: collecting data (and feeding the graph database) and sharing the data via API.

The biggest advantage from my side (from my experience) was the flexibility of the data model on Neo4j side - you don't have to hard code it on the graph databse or/and app code side - just implement the request commands for POST and GET and it's working for me :)

btw. as I used PHC-RSI infrastructure for that, so you need to be logged into RCN to be able to access this environment and code repository

btw2. here is link to code repository: https://github.roche.com/madejp/neo4j-python

Pawel-Madej commented 5 years ago

@jaeddy @Pawel-Madej Have we settled on adopting Neo4j + Cypher +py2neo +APOC?

Yes, I did. I've shared my experience in above post.

tschaffter commented 5 years ago

Closing this ticket since we have settled on the technology to use.

jaeddy commented 5 years ago

Thanks, @Pawel-Madej! I haven't encountered Bottle before — it's nice that you were able to create a basic front end UI for the app. I haven't quite figured out a good option for doing so with Flask yet (though it might make more sense to let the portal handle UI/viz stuff with JavaScript).

I've been able to get a (mostly) working version of the Provenance service using Neo4j and py2neo: https://github.com/Sage-Bionetworks/prov-service/tree/neo4j-endpoint

When you clone, install, and start the app (the command synprov should start the Flask server), it does some initial population of the database. The Cypher querying (though the Neo4j browser or other clients) works as expected:

And at least some operations work in the REST interface as well. For example, I can retrieve all entities that were USED by a particular activity:

The format/structure of the responses will probably need to be modified based on whatever visualization library we select. I'm still not sure if we ultimately want a REST endpoint, or if it makes more sense for the visualization component to query the Neo4j back end directly... However, I'm somewhat in favor of keeping a lot of the Cypher logic on the side of the provenance server. I think we can identify a lot of the most common queries and abstract these to simpler paths, allowing the client (the portal, in this case) to perform more 'user friendly' operations.

In terms of next steps/thoughts...

I need to sit down with @tschaffter and come up with a plan for visualization (any input from @Pawel-Madej or @lukasz-rakoczy is most welcome).
Neo4j has some nice features for importing from CSV. I'll try to use that to create a larger, more interesting graph.
We'll also want to give some more thought as to the data model for provenance (i.e., for Activity, Reference, and Agent node types). I have some ideas for properties we might want to add, but this might evolve as we implement more queries.
Enforcing uniqueness constraints seems a bit tricky in Neo4j. Because constraints only apply to a single property (for a particular node type) at a time, we might need to come up with some sort of hash to represent the multiple properties we think should collectively be unique.
Speaking of uniqueness, I haven't quite sorted out how to avoid adding duplicate activities... The checks I have in place to find existing activities based on the associated references (used) and agents work sometimes, but not always. I need to do some more debugging on this.
As we start to build out some more representative examples, we can also think about whether we want to stick with Python as the Neo4j intermediate layer. I tried out the OGM features provided by py2neo as well as another library called neomodel, but ended up having more success using direct driver/Cypher calls — I'm not sure if OGM libraries in other languages are more robust. I also haven't figured out how to take advantage of APOC through Python.

Pawel-Madej commented 5 years ago

Hi @jaeddy,

As mentioned earlier, our current experience around visualization is build on:

KeyLines - commercial product, which requires licence to purchases from Cambridge Intelligence. Allows for 21-days free trial. It's JavaScript framework.
D3.js - open source, but requires development effort and it's not designed specifically to graphs visualizations.
neovis.js - Graph visualizations powered by vis.js with data from Neo4j. More details here: https://github.com/neo4j-contrib/neovis.js/.

Sage-Bionetworks / prov-service

Identify database technology #1

First prototype of PHC Collaboration Portal

1. MongoDB + graphene-mongo (Python) + GraphQL

2. Neo4j + Cypher +py2neo +APOC