TREEcg / specification

RDF vocabulary and hypermedia specification to publish your Linked Data using search trees
https://w3id.org/tree/specification
29 stars 13 forks source link

How to best handle large numbers of `tree:member`s within the same `tree:Node`? #49

Closed ekulno closed 3 years ago

ekulno commented 3 years ago

I'm developing a TREE API (for Triply) and notice performance issues when fetching tree:Nodes for ranges which contain a large number of tree:members. I'm wondering what the best practice is for handling this.

I'm currently using a granularity level on a time predicate (prov:generedAtTime) to determine which entities to return for each tree:Node. For instance, the URL containing 2020-01-01T00:00 would return data for all entities which have a value for prov:generatedAtTime between 2020-01-01T00:00 and 2020-01-01T01:00 if the granularity is set to one hour. This makes the size of each tree:Node data-dependent. Simply making the granularity smaller would not resolve the issue, since it's possible for any number of entities to have the exact same value for prov:generatedAtTime.

Some possible solutions

Use traditional pagination args in tree:Node URLs

Let's say my:api/2020-01-01T00:00 returns 200 entities, and has tree:relation [a tree:GreaterThanOrEqualToRelation; tree:node <my:api/2020-01-01T10:00>]. I could change this to my:api/2020-01-01T00:00 only returning the first e.g. 100 entities, and to link to <my:api/2020-01-01T00:00?page=2> which would return the next 100 entities. <my:api/2020-01-01T00:00?page=2> would link to <my:api/2020-01-01T10:00>, as there is no further next page within the time range.

Use tree:import to separate the navigation and entity data

Instead of returning the entity data together with the navigation data, the entity data could be made available under a different API path, which I would reference with a tree:import statement. For example: <my:api/2020-01-01T00:00> tree:import <my:api/2020-01-01T00:00/entities>. With this solution navigation over the nodes would not be slowed down by fetching the entity data, but performance issues would still occur when a client follows these tree:import links.

Use a tree:import for each tree:member.

A variation of the above approach is for the tree:Node paths to still return IRIs for all entities which belong to that tree:Node, but to put additional data about the entities behind different imports. For example, a certain node would return <my:collection> a tree:Collection; tree:member <r:1>, <r:2>. <r:1> tree:import <my:api/describe/r:1>. <r:2> tree:import <my:api/describe/r:2>. This would still causes issues for tree:Nodes containing very big sets of entities, but in my particular case it's the expanded descriptions of the resources that cause issues.

pietercolpaert commented 3 years ago

I think a tree:import should always be a last resort (we’ll make sure that’s reflected in the spec as well soon). I would definately go for option 1!

ekulno commented 3 years ago

I'll go with 1 then, thanks!