Open brady77 opened 9 years ago
Hi Petr,
First of all, thanks for using SylvaDB and the kind words about it. This kind of messages is why we write software :D
You are completely right in your point. The main reason for using numeric ids instead of the Neo4j supported labels or types is that SylvaDB has been around way before than Neo4j 2.0, when they introduced labels and Cypher became usable. So we needed to figure out a good way to make SylvaDB feature complete and future proof. Back then, we felt that node types was a feature we needed and as such we decided to store the SylvaDB schema types ids as a reserved property on the nodes. To be consistent, we did the same thing for relationships and ignored the Neo4j relationship types.
On the bright side, this design allows us to plug any other graph backend, since all we need is a property-value store in the nodes and relationships, avoiding the need to be tied to Neo4j--the industry changes faster than we can code. Furthermore, it allowed nodes and relationship types to be changed at no cost, while in Neo4j is still a feature that does not exist.
Unfortunately, as a side effect, this made practically impossible to use the Neo4j backend without the SylvaDB frontend.
That being said, we understand that the expressive power of our query builder is still pretty limited. We are thinking on ways to integrate algorithms, like shortestPaths
, into the query builder as well. But is more a visual design challenge than a technical one.
In summary, I see three choices:
_label
and the Neo4j labels of a node, every time a user decides to change a type, we would have to remove the labels associated to all the nodes, and add the new label to all of them. For relationship types is even more expensive, as we would have to recreate all the affected relationships and delete the old ones. I think that we could add a SUPPORT_FOR_LABELS
option in the GRAPHDATABASES
setting to allow this behaviour without penalizing the advanced users that don't need it.Well, as you see, every option has advantages and drawbacks, and it is hard to make decisions that we will not regret in the future :) We believe that option 2 is the most reasonable one in terms of time needed to be implemented. But option 3 will make a nice feature to have, even more taking into account the future SylvaDB API that we are planning.
Please, let me know your thoughts.
Thank you for the insight, Javier.
You probably know Structr. Even if the purpose of this system is completely different from SylvaDB, the way how they store data is probably closer to this point. Unfortunately, I don't know how they deal with changes in the schema. Structr is much more "Neo4j only" system, so there is no need to be "DB independent" - in contrast to your strategic goals.
I'm neither a programmer nor I understand the architecture of information systems, so sorry if I am completely wrong, but... I came across Tinkerpop Blueprints recently - which is probably something to overcome your concerns about independence. This model supports various database systems, so this may be a viable way to keep your application consitent regardless of current DB. Just a hint...
Well, the first option (where nodes are labeled and relationships are typed) is closest to my needs, but from your point of view the most complicated, probably.
Let me suggest another option:
Summary of what I think:
It is my pleasure to discuss this with you, Javier.
Petr.
PS: Just a note regarding the Neo4j version: now I see why you are still at 1.x... you don't need the node labels, so why to migrate, right?
Sure. Thank you, Petr, for the healthy discussion :+1:
Has
. That will create the internal SylvaDB _label
storing the id, plus a, for example, _label_neo4j
storing Has
. Tomorrow I decide to rename Has
to Makes
; _label
will keep the same, and for new nodes SylvaDB will store _label_neo4j
with the value Makes
. But older nodes will still have _label_neo4j
containing Has
, producing inconsistencies among the nodes of the same type. It is true that setting a property of a set of nodes should be less problematic than deleting/creating relationships. I might be wrong, but that does not solve the problem of not having proper Neo4j labels and types available in Cypher.So I think that our approach will be 2 first, and then 3. I will keep thinking on ways to improve this. Maybe we could add an advanced option to synchronize SylvaDB types and allowed relationships with Neo4j labels and types. That will take some time, but if the user is who initiates the action, he knows what is doing. This synchronization process could be executed at any time, but there would be a warning about how much time this could take. It would execute in a task and make the specific graph unavailable while the transaction is being executed. How that sounds?
Just a quick comment on point 4: Yes, this was exactly what I meant by suggesting those new 'system properties'. The whole idea behind my proposal was to make 'direct cypher DB querying' possible using real names (even if the 'property query' will take longer to finish compared to the 'native query' using node labes and relationship types). At the same time this should be cheaper for you to change the property on affected nodes and (especially) relationships compared to changing labes and/or types.
But you are now proposing something much better: this 'user-initiated on-demand syncing' of labels and relationship types is actually one of the best solutions of the problem (at least from my point of view). This will provide better query performance and easier cypher query building. A precondition is a 2.x Neo4j version, indeed.
As a matter of fact - I just wanted to make my problem clear and I feel you entirely grasped my thoughts, immediately. All my subsequent posts are how to make it rather than what I need. This is your home and I am sure you will come with an optimal solution. Should you need me for testing or anything else, let me know. I'm ready to help you.
Thank you for listening.
Thanks Petr.
I'm still thinking on ways to implement this. While my last proposal sounds like the best approach in your case, unfortunately there is some nuisances. We can sync current SylvaDB types and allowed relationships to Neo4j labels and types. But the problem is what happens when there are two or more graphs, from the same or different users, using the same name for a type. We keep track of those cases in the schema by creating a unique slug per type. Let's say then that in graph A the type Person
becomes internally the slug person
, and in graph B, the same type Person
becomes person-2
to avoid overlapping the name. If we assign Person
as a label in the Neo4j backend for graph A, and we do the same for graph B, Neo4j will have nodes labelled as Person
that actually belong to two different graphs, causing problems with queries. That is due to a lack of multitenancy in Neo4j.
However, if you are still OK with using slugs, although a bit more verbose, we can proceed and plan the feature :)
Oh, I see this obstacle...
Well, if I consider all of this - it seems to me that the point No.4 (using special named properties) is the easiest solution to implement, now.
I would be able to limit my cypher query with the _graph
property and select nodes and/or relationships by the _label_neo4j
property value.
You will have a cheap way for updating those system properties on all relevant places within the subgraph (no deleting / recreating relationships). And as a bonus you don't have to take special care about uniqueness of the labes/reltypes.
Update can by triggered:
a) automatically after user changes the schema
b) manually (on demand) by the user from the menu
This can be valid until Neo4j comes with the multitenancy support. I think it is not far away, because the last milestone introduces some form of basic authentication, already.
No.5 (syncing labels and types into running database) is also perfectly valid. But the cost is higher:
a) the missing multitenancy forces you to guarantee uniqueness of label/reltype names across the whole system
b) updating the relationships represents deleting and recreating it
Regarding the label / reltype uniqueness: if the distinguisher is always same for all the nodes labes / relationship types, then it will be quite straightforward to strip it off while parsing results. Let's say that all the node labels will be suffixed with the same id that is used in the _graph
property. The resulting label will be Disease code-2
or disease-code-2
. If uniquness will be accomplished differently (e.g. by using slugs), this will make autostripping almoust impossible.
There is another point to mention: the query performace will be probably the same for both the no.4 and no.5, because for every node and for every relationship a `_graph' property has to be consulted, which significantly slows the process. I'm not sure if indexing can be any help.
I still hold the view that No.4 will be sufficient and easy. And if you combine it with an option to dump the database in the 2.x version format (i.e. generating labels for nodes and relationships based on real user data), it is perfect.
No.5 is OK, too. But uniqueness should be achieved as mentioned above to allow for stripping.
Thank you Javier to paying so much attention to this issue. I'm just afraid that my highly demanding comments and opinions may lead to putting you off this issue. At the same time I don't want to push you somewhere you don't want to go. Should you feel unconfortable with any proposal, just go your way. I know you will come with a great solution just based on the knowledge of my needs (as you did with SylvaDB ever before).
You made a good point. Maybe instead of using slugs, just suffixing the type with the internal schema id is enough. I'll think about it.
Before labels existed in Neo4j, what now are legacy indices were the only way to speed up queries. Therefore adding that information, in a START
clause or using the indices support in Cypher, should improve performance over just checking the _graph
property.
I will leave this thread open and discuss it with the team. Probably a implementation of 4 would be our first approach. But you know, it is not one of the priorities right now, so sorry in advance for delays in delivering the feature.
And thank you very much for your insights. It is only with real users input that we can build a cool platform.
Hi,
SylvaDB is great system. It fulfills my needs almoust perfectly. Nevertheless, I have some issues...
You describe SylvaDB on GitHub as "... a Relaxed-Schema Graph Database Management System." The main problem with this definition is that there is a difference between data stored inside the SylvaDB system and data stored in Neo4j database. I'll try to explain my issue using an example:
In the 'Schema' I create a new 'Type' with the name 'Person' and in the 'Properties' section I fill in a 'Key' called 'Full name'. In the same way I add a new type 'Movie' with a key 'Title'. Next I create new 'Allowed Relationship' using 'Person' as 'Source', 'Movie' as 'Destination' and 'ACTS_IN' as the 'Name' of the relationship.
Then I put some data into the SylvaDB. First I add 'New Person' with a key 'Full name' holding the value 'Keanu Reeves'. Then I use 'New Movie' to create a node with the key 'Title' and the value 'Matrix'. At the same time I fill in the field '<- ACTS_IN' by searching for the value 'Keanu Reeves'.
I made this detailed description just to show you what I would expect to be created in the Neo4j database (following your statement that SylvaDB is a GraphDB Management System): 1) Node A with a 'label' = 'Person' and with a property key 'FullName' = 'Keanu Reeves' 2) Node B with a 'label' = 'Movie' and with a property key 'Title' = 'Matrix' 3) Relationship of 'type' = 'ACTS_IN' between node A and node B (even if your Neo4j version does not support labels, you are using the '_label' key toto mimic this function, which is fine).
If I look into the Neo4j database with Cypher, I get the following data: Node[3]{_id:3,_label:"8",_graph:"2",Full name:"Keanu Reeves"} Node[4]{_id:4,_label:"9",_graph:"2",Title:"Matrix"} :8[1] {_id:1,_label:"8",_graph:"2"}
The first problem here is the substitution of node '_label' value by some SylvaDB internal IDs. The second problem is the substitution of relationship type by another internal ID. I'm afraid I don't know the logic of assigning those IDs and even if I knew it, it would be really hard for a human to translate those IDs into some meaningful queries and results.
The need for internal IDs within Neo4j database is obvious - SylvaDB works with objects and those objects should be identified somehow. But why those IDs replace some important values and/or types? From my point of view - internal IDs should be an addition to the data, not a replacement. If you intention was to make user changes simple (like changing the relationship type name, which is quite compliated within Neo4j - but not impossible), the cost is too high. Data stored in the Neo4j are useless without the SylvaDB frontend. This comes to surface as soon as the 'Queries' interface is not sufficient for certain type of queries (like getting the shortest path between nodes etc).
I can imagine how complicated a rework of the system could be to allow for real values / types. Nevertheless, will you consider to change the system accordingly? There is no better system to work with the Neo4j graph database than SylvaDB. It is simple, intuitive, flexible, powerful and user-friendly. And can be even better...
Thanks for your opinion.
Petr