ArangoDB-Community / arangodb-tinkerpop-provider

An implementation of the Tinkerpop OLTP Provider for ArangoDB
Apache License 2.0
84 stars 16 forks source link

Feature Request: support more and 2048 node/edge labels #64

Open kuzeko opened 3 years ago

kuzeko commented 3 years ago

Currently ArangoDB cannot support more than 2048 node/edge labels (in total?) since all labels are mapped to collections.

See: https://github.com/ArangoDB-Community/arangodb-tinkerpop-provider/issues/63

Some comments:

1 - A new feature would be required to decide whether to store 1-label-per-collection or labels as properties.

2 - This will need to take into account also how indexing/traversal is handled for queries like:

g.V().out('label1').out('label2')...

3 - Currently the documentation should mention upfront this limitation

MartinBrugnara commented 3 years ago

@arcanefoam IIRC, the adapter for tinkerpop 2.x worked around this problem by puting all nodes in a collection V and all edges in a collection E. But this solution would surely have performance implications...

arcanefoam commented 3 years ago

Hi, as stated on #54 the provider architecture and the use of ArangoDB's named graph API makes this change a difficult one.

However, this particular one adds the extra burden of not using collections for labels. This would probably result in not being able to use the Graph syntax of AQL thus a complete re-write of the AQL queries used by the current implementation. Thus, implementation wise we would need to change the architecture to add something like a "AQL translator" and then we could provide different implementations, one with Graph AQL, one with plain AQL (plus the integrity checks for #54).

kuzeko commented 3 years ago

Sorry, but does this means that, in the end, the real problem is that Arango graphs and AQL itself have a physical limit to number of node/edge labels? So we are trying to fix with a hack in the gremlin driver a real limitation in the underling system?

arcanefoam commented 3 years ago

In the current implementation each label is mapped to a collection, as @dothebart mentioned, there is a limit of 2048 collections and as a result a limit to the number of labels supported by the arango tinkerpop provider.

The decision to model labels as collections was an architectural one driven by the decision to rely on the ArangoDB AQL support for graphs (see #58). If labels are modelled as vertex/edge properties then we could have two collections (Vertex, Edges), but we would need to use "normal" AQL and implement all the graph logic in the provider. It is possible, but hard.

Does this clarify your question?

kuzeko commented 3 years ago

Not completely, does Arango DB natively have a concept of node/edge labels? If yes, I understand that in AQL a node label is the same of a node collection, ditto for edges. If this is also correct, then, trying to have this driver support more node/edge labels, is the equivalent of trying to have the driver overcome a limitation of the underlying system. Here is then my question: are we trying to achieve this ?

arcanefoam commented 3 years ago

No, ArangoDB does not have a concept of label, label is a Tinkerpop concept.

So, for implementation I had two choices: 1. Map labels to collections, i.e. a collection contains all nodes/edges with the same label; 2. Map labels to document properties, i.e. each document will have a property called "label" with a value of the corresponding node/edge label.

  1. Good: because it matches the behaviour of ArangoDB graphs, meaning we can translate most tinkerpop expressions to AQL's graph syntax and we get graph integrity for free. Users can use tinkerpop to analyse their existing grpahs. Bad: it forces users to define the graph "schema" beforehand, i.e. node and edge labels must be known a priori. Limited to 2048 labels.

  2. Good: No need for schema. No 2048 label limit. Bad: We can't use AQL's graph syntax. We don't get grapgh integrity for free. Adding the label as a property means we would need to "corrupt" existing graphs by adding additional label property. Possible performance overhead for the additional filter by property vs select from collection (until we benchmark it is impossible to know the actual hit).

I went with 1. To support more than 2048 labels we would need to use 2. So for 2048 we would need to overcome a limitation of the tinkerpop provider implementation, as in change how it works.

Does this explanation makes it easier to understand?

On Mon, 24 Aug 2020, 18:43 M. Lissandrini, notifications@github.com wrote:

Not completely, does Arango DB natively have a concept of node/edge labels? If yes, I understand that in AQL a node label is the same of a node collection, ditto for edges. If this is also correct, then, trying to have this driver support more node/edge labels, is the equivalent of trying to have the driver overcome a limitation of the underlying system. Here is then my question: are we trying to achieve this ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArangoDB-Community/arangodb-tinkerpop-provider/issues/64#issuecomment-679270372, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQOU3M3KJJGBSKJ5LWJSH3SCKRDDANCNFSM4QHFUDQQ .

kuzeko commented 3 years ago

Thanks, this is all clear.

But I still believe we are trying to overcome Arango's limitation: not having labels (which is worse than only allowing 2048 labels)

Anyway, for reference to others coming to the issue, here I found the relevant bit: https://www.arangodb.com/docs/stable/graphs.html#multiple-edge-collections-vs-filters-on-edge-document-attributes

If you want to only traverse edges of a specific type, there are two ways to achieve this. The first would be an attribute in the edge document - i.e. type, where you specify a differentiator for the edge - i.e. "friends", "family", "married" or "workmates", so you can later FILTER e.type = "friends" if you only want to follow the friend edges.

Another way, which may be more efficient in some cases, is to use different edge collections for different types of edges, so you have friend_edges, family_edges, married_edges and workmate_edges as collection names. You can then configure several named graphs including a subset of the available edge and vertex collections - or you use anonymous graph queries, where you specify a list of edge collections to take into account in that query. To only follow friend edges, you would specify friend_edges as sole edge collection

The multiple edge collections approach is limited by the number of collections that can be used simultaneously in one query. Every collection used in a query requires some resources inside of ArangoDB and the number is therefore limited to cap the resource requirements. You may also have constraints on other edge attributes, such as a hash index with a unique constraint, which requires the documents to be in a single collection for the uniqueness guarantee, and it may thus not be possible to store the different types of edges in multiple edge collections.

So, if your edges have about a dozen different types, it’s okay to choose the collection approach, otherwise the FILTER approach is preferred. You can still use FILTER operations on edges of course. You can get rid of a FILTER on the type with the former approach, everything else can stay the same.

dothebart commented 3 years ago

ArangoDB doesn't have an as strict data model to how edges and vertices have to look like as other solutions available.

For ArangoDB all that sets appart edges from regular documents is, that they live in special edge collections which demand (and index) the availability of the _from and _to keys. I.e. in the arangodb web interface, you can choose any attribute to be shown as an edge or vertex label.

https://www.arangodb.com/docs/stable/aql/graphs-traversals.html#filtering-edges-on-the-path demonstrates how to classify edges by an additional property.

kuzeko commented 3 years ago

Is not a "strict" data model, it is the property graph model. As far as I know, also in other solutions you can always filter for properties on edges, yet , when searching for labels you benefit from an additional speedup because this is using some special data-structure or index. The way I see it, in ArangoDB the only thing that can mimic that and offers similar advantages are "collections", which are limited though.

dothebart commented 3 years ago

Hi, In ArangoDB you can use vertex centric indices for that. They index one of _from and _to and the Attribute you want to use as label. Hence if you want to traverse the graph in forward and backward direction (in AQL IN/OUT|ANY), you need to create two indices.

So its probably a question of the count of your label, whether the cost of maintaining the additional edge collections, or the different indices are higher. If the count is low, (and index selectivity bad for that matter) going the multiple edge collection way is probably better.

kuzeko commented 3 years ago

@dothebart the edge collection is not an options, because i have some 4 thousands edge types

dothebart commented 3 years ago

Yes, for your usecase vertex centric indices are definitely the way to go. I just wanted to mention, that both ways are viable options, and one should choose by ones own usecase.

So maybe @arcanefoam (or you?) can create the option to choose between vertex centric indices and collections.