Data stored in the Neo4j database

CulturePlex / Sylva

A Relaxed Schema Graph Database Management System

sylvadb.com

Other

53 stars 9 forks source link

Data stored in the Neo4j database #23

Open brady77 opened 9 years ago

brady77 commented 9 years ago

Hi,

SylvaDB is great system. It fulfills my needs almoust perfectly. Nevertheless, I have some issues...

You describe SylvaDB on GitHub as "... a Relaxed-Schema Graph Database Management System." The main problem with this definition is that there is a difference between data stored inside the SylvaDB system and data stored in Neo4j database. I'll try to explain my issue using an example:

In the 'Schema' I create a new 'Type' with the name 'Person' and in the 'Properties' section I fill in a 'Key' called 'Full name'. In the same way I add a new type 'Movie' with a key 'Title'. Next I create new 'Allowed Relationship' using 'Person' as 'Source', 'Movie' as 'Destination' and 'ACTS_IN' as the 'Name' of the relationship.

Then I put some data into the SylvaDB. First I add 'New Person' with a key 'Full name' holding the value 'Keanu Reeves'. Then I use 'New Movie' to create a node with the key 'Title' and the value 'Matrix'. At the same time I fill in the field '<- ACTS_IN' by searching for the value 'Keanu Reeves'.

I made this detailed description just to show you what I would expect to be created in the Neo4j database (following your statement that SylvaDB is a GraphDB Management System): 1) Node A with a 'label' = 'Person' and with a property key 'FullName' = 'Keanu Reeves' 2) Node B with a 'label' = 'Movie' and with a property key 'Title' = 'Matrix' 3) Relationship of 'type' = 'ACTS_IN' between node A and node B (even if your Neo4j version does not support labels, you are using the '_label' key toto mimic this function, which is fine).

If I look into the Neo4j database with Cypher, I get the following data: Node[3]{_id:3,_label:"8",_graph:"2",Full name:"Keanu Reeves"} Node[4]{_id:4,_label:"9",_graph:"2",Title:"Matrix"} :8[1] {_id:1,_label:"8",_graph:"2"}

The first problem here is the substitution of node '_label' value by some SylvaDB internal IDs. The second problem is the substitution of relationship type by another internal ID. I'm afraid I don't know the logic of assigning those IDs and even if I knew it, it would be really hard for a human to translate those IDs into some meaningful queries and results.

The need for internal IDs within Neo4j database is obvious - SylvaDB works with objects and those objects should be identified somehow. But why those IDs replace some important values and/or types? From my point of view - internal IDs should be an addition to the data, not a replacement. If you intention was to make user changes simple (like changing the relationship type name, which is quite compliated within Neo4j - but not impossible), the cost is too high. Data stored in the Neo4j are useless without the SylvaDB frontend. This comes to surface as soon as the 'Queries' interface is not sufficient for certain type of queries (like getting the shortest path between nodes etc).

I can imagine how complicated a rework of the system could be to allow for real values / types. Nevertheless, will you consider to change the system accordingly? There is no better system to work with the Neo4j graph database than SylvaDB. It is simple, intuitive, flexible, powerful and user-friendly. And can be even better...

Thanks for your opinion.

Petr

versae commented 9 years ago

Hi Petr,

First of all, thanks for using SylvaDB and the kind words about it. This kind of messages is why we write software :D

You are completely right in your point. The main reason for using numeric ids instead of the Neo4j supported labels or types is that SylvaDB has been around way before than Neo4j 2.0, when they introduced labels and Cypher became usable. So we needed to figure out a good way to make SylvaDB feature complete and future proof. Back then, we felt that node types was a feature we needed and as such we decided to store the SylvaDB schema types ids as a reserved property on the nodes. To be consistent, we did the same thing for relationships and ignored the Neo4j relationship types.

On the bright side, this design allows us to plug any other graph backend, since all we need is a property-value store in the nodes and relationships, avoiding the need to be tied to Neo4j--the industry changes faster than we can code. Furthermore, it allowed nodes and relationship types to be changed at no cost, while in Neo4j is still a feature that does not exist.

Unfortunately, as a side effect, this made practically impossible to use the Neo4j backend without the SylvaDB frontend.

That being said, we understand that the expressive power of our query builder is still pretty limited. We are thinking on ways to integrate algorithms, like shortestPaths, into the query builder as well. But is more a visual design challenge than a technical one.

In summary, I see three choices:

Alongside our internal node types, we could store Neo4j labels if the version is 2.0+. But the cost of changing Neo4j labels and specially relationship types is too high. To avoid the mismatch between the internal and consistent SylvaDB _label and the Neo4j labels of a node, every time a user decides to change a type, we would have to remove the labels associated to all the nodes, and add the new label to all of them. For relationship types is even more expensive, as we would have to recreate all the affected relationships and delete the old ones. I think that we could add a SUPPORT_FOR_LABELS option in the GRAPHDATABASES setting to allow this behaviour without penalizing the advanced users that don't need it.
We could provide a downloadable Cypher dump file (.cql) to restore your backend into another Neo4j instance. That dump could include the current types and allowed relationships, and embed them to properly create all the Neo4j node labels and relationship types. This, on the other hand, will be separated from the current state of the graph, but will allow you to use the whole power or Cypher.
We could allow the user to enter Cypher queries in the browser, but we'd need to parse them to inject the types and allowed relationships when needed. Parsing is not fun, plus, we would also need to avoid modifying the graph, as that could create an inconsistent state of for SylvaDB. Only read-only queries would be allowed.

Well, as you see, every option has advantages and drawbacks, and it is hard to make decisions that we will not regret in the future :) We believe that option 2 is the most reasonable one in terms of time needed to be implemented. But option 3 will make a nice feature to have, even more taking into account the future SylvaDB API that we are planning.

Please, let me know your thoughts.

brady77 commented 9 years ago

Thank you for the insight, Javier.

You probably know Structr. Even if the purpose of this system is completely different from SylvaDB, the way how they store data is probably closer to this point. Unfortunately, I don't know how they deal with changes in the schema. Structr is much more "Neo4j only" system, so there is no need to be "DB independent" - in contrast to your strategic goals.

I'm neither a programmer nor I understand the architecture of information systems, so sorry if I am completely wrong, but... I came across Tinkerpop Blueprints recently - which is probably something to overcome your concerns about independence. This model supports various database systems, so this may be a viable way to keep your application consitent regardless of current DB. Just a hint...

Well, the first option (where nodes are labeled and relationships are typed) is closest to my needs, but from your point of view the most complicated, probably.
If you are able to create the dump file with labes and typed relationships, I am quite happy with it. It is actually what I do now by my funny scripting in Ruby - exporting .csv files from SylvaDB, parsing the files, extracting labels, reltypes and properties and generating Cypher command (LOAD CSV WITH HEADERS FROM). So this option will avoid my scripting. Nevertheless, it is still a little bit inconvenient. Better option is to talk to the same database from different frontends, certainly.
Cypher within SylvaDB is a nice idea. It will provide for a quick insights into stored data with unlimited query complexity. A question is what will be the best format for the results (so they can be reused elsewhere).

Let me suggest another option:

You will keep current approach to labeling nodes and rel types by internal IDs, but - you will aslo add a new 'system property' for 'named node label' and 'named rel type'. Those properties will be updated by current schema naming. The advantage is that you can avoid any difficulties by deleting / recreating relationships in case of user changes schema (you will still rely on your IDs - nothing changes here) and at the same time you will provide real label names and relationship type names for external Cypher queries.

Summary of what I think:

A quick solution: 2 + 4 (maybe together)
Best solution: 1 (via Tinkerpop or something similar)

It is my pleasure to discuss this with you, Javier.

Petr.

PS: Just a note regarding the Neo4j version: now I see why you are still at 1.x... you don't need the node labels, so why to migrate, right?

versae commented 9 years ago

Sure. Thank you, Petr, for the healthy discussion :+1:

I know Structr, although I don't know if they manage schema migration/evolution. TinkerPop is a well known player as well. In fact, SylvaDB already has support for the Blueprints API, so in theory we could plug any Blueprints-compliant API to SylvaDB. With an exception: queries, that would need to generate Gremlin instead of Cypher. Still the schema migration problem persists. Furthermore, the TinkerPop people released their version 3 recently and made backward incompatible changes. Bummer.
I agree on that the ideal solution is to be able to query the backend directly. But restricting queries to be read-only is not a builtin Neo4j feature yet--write queries will produce inconsistency in SylvaDB. Therefore, I think that for now will work on creating a downloadable option that includes Neo4j labels and relationship types.
We will investigate this further. A member of the team, @davebshow, already has a functional parser for a Cypher-like DSL that we would like to extend to translate user-generated Cypher to SylvaDB restricted Cypher.
I see your point, but not sure if it avoids the deleting/recreating problem. Let's say I have an allowed relationship Has. That will create the internal SylvaDB _label storing the id, plus a, for example, _label_neo4j storing Has. Tomorrow I decide to rename Has to Makes; _label will keep the same, and for new nodes SylvaDB will store _label_neo4j with the value Makes. But older nodes will still have _label_neo4j containing Has, producing inconsistencies among the nodes of the same type. It is true that setting a property of a set of nodes should be less problematic than deleting/creating relationships. I might be wrong, but that does not solve the problem of not having proper Neo4j labels and types available in Cypher.

So I think that our approach will be 2 first, and then 3. I will keep thinking on ways to improve this. Maybe we could add an advanced option to synchronize SylvaDB types and allowed relationships with Neo4j labels and types. That will take some time, but if the user is who initiates the action, he knows what is doing. This synchronization process could be executed at any time, but there would be a warning about how much time this could take. It would execute in a task and make the specific graph unavailable while the transaction is being executed. How that sounds?

brady77 commented 9 years ago

Just a quick comment on point 4: Yes, this was exactly what I meant by suggesting those new 'system properties'. The whole idea behind my proposal was to make 'direct cypher DB querying' possible using real names (even if the 'property query' will take longer to finish compared to the 'native query' using node labes and relationship types). At the same time this should be cheaper for you to change the property on affected nodes and (especially) relationships compared to changing labes and/or types.

But you are now proposing something much better: this 'user-initiated on-demand syncing' of labels and relationship types is actually one of the best solutions of the problem (at least from my point of view). This will provide better query performance and easier cypher query building. A precondition is a 2.x Neo4j version, indeed.

As a matter of fact - I just wanted to make my problem clear and I feel you entirely grasped my thoughts, immediately. All my subsequent posts are how to make it rather than what I need. This is your home and I am sure you will come with an optimal solution. Should you need me for testing or anything else, let me know. I'm ready to help you.

Thank you for listening.

versae commented 9 years ago

Thanks Petr.

I'm still thinking on ways to implement this. While my last proposal sounds like the best approach in your case, unfortunately there is some nuisances. We can sync current SylvaDB types and allowed relationships to Neo4j labels and types. But the problem is what happens when there are two or more graphs, from the same or different users, using the same name for a type. We keep track of those cases in the schema by creating a unique slug per type. Let's say then that in graph A the type Person becomes internally the slug person, and in graph B, the same type Person becomes person-2 to avoid overlapping the name. If we assign Person as a label in the Neo4j backend for graph A, and we do the same for graph B, Neo4j will have nodes labelled as Person that actually belong to two different graphs, causing problems with queries. That is due to a lack of multitenancy in Neo4j.

However, if you are still OK with using slugs, although a bit more verbose, we can proceed and plan the feature :)

brady77 commented 9 years ago

Oh, I see this obstacle...

Well, if I consider all of this - it seems to me that the point No.4 (using special named properties) is the easiest solution to implement, now. I would be able to limit my cypher query with the _graph property and select nodes and/or relationships by the _label_neo4j property value. You will have a cheap way for updating those system properties on all relevant places within the subgraph (no deleting / recreating relationships). And as a bonus you don't have to take special care about uniqueness of the labes/reltypes. Update can by triggered: a) automatically after user changes the schema b) manually (on demand) by the user from the menu This can be valid until Neo4j comes with the multitenancy support. I think it is not far away, because the last milestone introduces some form of basic authentication, already.

No.5 (syncing labels and types into running database) is also perfectly valid. But the cost is higher: a) the missing multitenancy forces you to guarantee uniqueness of label/reltype names across the whole system b) updating the relationships represents deleting and recreating it Regarding the label / reltype uniqueness: if the distinguisher is always same for all the nodes labes / relationship types, then it will be quite straightforward to strip it off while parsing results. Let's say that all the node labels will be suffixed with the same id that is used in the _graph property. The resulting label will be Disease code-2 or disease-code-2. If uniquness will be accomplished differently (e.g. by using slugs), this will make autostripping almoust impossible.

There is another point to mention: the query performace will be probably the same for both the no.4 and no.5, because for every node and for every relationship a `_graph' property has to be consulted, which significantly slows the process. I'm not sure if indexing can be any help.

I still hold the view that No.4 will be sufficient and easy. And if you combine it with an option to dump the database in the 2.x version format (i.e. generating labels for nodes and relationships based on real user data), it is perfect.

No.5 is OK, too. But uniqueness should be achieved as mentioned above to allow for stripping.

Thank you Javier to paying so much attention to this issue. I'm just afraid that my highly demanding comments and opinions may lead to putting you off this issue. At the same time I don't want to push you somewhere you don't want to go. Should you feel unconfortable with any proposal, just go your way. I know you will come with a great solution just based on the knowledge of my needs (as you did with SylvaDB ever before).

versae commented 9 years ago

You made a good point. Maybe instead of using slugs, just suffixing the type with the internal schema id is enough. I'll think about it.

Before labels existed in Neo4j, what now are legacy indices were the only way to speed up queries. Therefore adding that information, in a START clause or using the indices support in Cypher, should improve performance over just checking the _graph property.

I will leave this thread open and discuss it with the team. Probably a implementation of 4 would be our first approach. But you know, it is not one of the priorities right now, so sorry in advance for delays in delivering the feature.

And thank you very much for your insights. It is only with real users input that we can build a cool platform.