TREEcg / extract-cbd-shape

Get all RDF triples/quads related to an entity based on CBD and a SHACL shape
https://treecg.github.io/extract-cbd-shape/
MIT License
9 stars 2 forks source link

Performance increase idea: graphs to ignore to graphs to look into list #27

Open pietercolpaert opened 6 months ago

pietercolpaert commented 6 months ago

The first step when getting all the graphs to ignore, is making sure that your CBD and Shape Templates algorithm only further looks into the graphs it doesn’t need to ignore.

pietercolpaert commented 6 months ago

at this moment we do for example this in Path.ts

    let quads = (
      inverse
        ? store.getQuads(null, this.predicate, focusNode, null)
        : store.getQuads(focusNode, this.predicate, null, null)
    ).filter((q) => !graphsToIgnore.includes(q.graph.value));

If however we instead of passing graphsToIgnore, pass the graphs to look into, we can change this piece of code into:

let quads = [];
for (let graph of graphs) {
        quads = quads.concat( (inverse? store.getQuads(null, this.predicate, focusNode, graph) : store.getQuads(focusNode, this.predicate, null, graph) )
}

Which I believe could go a lot faster thanks to the graph index in the RDFStore.

Getting to the list of graphs is easy: we just get all graphs that are being used, and remove all graphs that currently we put in the ignorelist (the other member IRIs)

pietercolpaert commented 5 months ago

Status update:

We forked rdf-stores and added a termCardinalities index. We pull requested this upstream but this is not going to be accepted. https://github.com/rubensworks/rdf-stores.js/pull/8

For the time being we could use our experimental fork, or we could instead of forking it, also provide an extended RDF-store with termCardinality support by creating our own interface with the precise functionality we need. If an RDF Store is provided nonetheless, we can then still add the index we need to make it work for our use case.

interface RDFConnectQuadStore {
   getSubjects(predicate, object, graph); // gets a set of subjects
   getObjects(subject, predicate, graph); // gets a set of objects
   getGraphs(); //gets a set of graphs
   store: RdfStore; // Exposing the main store itself and use as-is
}
ajuvercr commented 5 months ago

I would like to find a path where end users don't have to worry about which store they pass along. So maybe even change RdfStore to RDF/JS: Dataset and try to build efficient things around that.

The main problem is that detecting which graphs are present in the datastore is a O(n) operation if it is not supported by the store implementation and cannot be cached because that store instance might change between extract operations.

There are of course options in the more darker parts of javascript, but let's not sacrifice our sanity in the pursuit of performance.

pietercolpaert commented 5 months ago

Yeah, I also don’t see it as optimal, but I do think it’s a thing that works for now in a better way than using an experimental rdf-stores fork

pietercolpaert commented 4 months ago

I think it makes sense to make this store something reusable within RDF Connect, and also export a store interface for members, that would allow for sorting on timestampPath for example?