bbcarchdev / spindle

RES Linked Open Data aggregation engine
https://bbcarchdev.github.io/spindle/
Apache License 2.0
2 stars 1 forks source link

Rule base can not process indirect properties #96

Open cgueret opened 7 years ago

cgueret commented 7 years ago

From a mail conversation:

Fundamentally the problem is that we need to tell spindle-generate that the source of some property is not the entity we are generating the proxy for, but one which is connected to it. There may be chains of these relationships. We can't embed arbitrary cross-graph SPARQL queries because it prevents us from moving away from a SPARQL-capable store in the future (which is more than simply a hypothetical).

Therefore, we need to tell the proxy-generation code that the source data for some property is reachable via some other property, and ensure that the triggers mechanism is used appropriately.

The simplest way to do this is to expand the rulebase vocabulary (and accompanying processing) such that it allows us to express this relationship.

The most obvious way to do this is to add a property to the current property descriptions. We can do this by introducing two new terms: spindle:via and spindle:viaInverse.

Let's say for example that we encounter a common pattern where mrss:player is expressed on an entity which has a ex:mediaObject relationship with the proxy we're generating. We can add the following:

mrss:player spindle:property [ olo:index 0 ; spindle:expressedAs mrss:player ; spindle:via ex:mediaObject ] .

In a related example, let's say the mrss:player is expressed on a creative work whose relationship to the proxy we're generating is a foaf:topic relationship (i.e., the creative work has a topic of the entity we're generating a proxy for). We can therefore say:

mrss:player spindle:property [ olo:index 0 ; spindle:expressedAs mrss:player ; spindle:viaInverse foaf:topic ] .

Once the rulebase parsing code has been updated to understand these kinds of expressions, the property-generation code will need to be updated to ignore them.

Next, we'll need to add a pass to the property-generation code to actually process these relationships, initially by generating simple SPARQL queries to locate them (but in future, because we're limiting this to one-degree-of-separation at a time, we can swap those SPARQL queries for SQL) and adding the relayed properties to the proxy model.

Then, at least in an initial implementation, we can add a TK_PROXY trigger between the entity we're generating and the entity the source property came from, taking care to ensure the relationship is expressed the correct way around!

At that point, we have a basic functional implementation of what we need in a couple of hundred lines of code.

We can then trigger this particular kind of 2nd-degree update on a new flag - TK_PROXY_VIA, so that we don't bother to re-build the whole proxy when this update occurs. The actual generation code still needs to happen at more or less the same time as TK_PROXY, though — so TK_PROXY_VIA processing should happen immediately after TK_PROXY, we're just splitting the flags for more granular processing.

Ideally, the named graph we use when populating the proxy model for this data will be the source entity's, but that's a nice to have.

The challenge, however, is detecting dependency loops, and I've not yet figured out how best to do that.

Implementing this will make it possible to process data modelled with EDM

townxelliot commented 7 years ago

While a single step via rule could work for some cases, there are others where we may have to cope with arbitrarily long chains. For example:

:Work ex:hasEditions :Editions .
:Editions ex:hasEdition :Edition .
:Edition ex:hasCover :Cover .
:Cover ex:mediaObject :Media .
:Media ex:hasFormat <http://foo.bar/image.jpg> .
:Media ex:hasFormat <http://foo.bar/image.png> .

What is the target RDF we want from this? Perhaps:

:MediaProxy foaf:topic :WorkProxy .
:MediaProxy mrss:content <http://foo.bar/image.jpg> .
:MediaProxy mrss:content <http://foo.bar/image.png> .

(this would enable someone searching for the work with media=image to get all of the images relating to all of its editions)

NB Deciding what the proxy RDF should look like is also part of the problem, as we have to make a judgement about the data we eventually want in the RES index. Should someone be able to get all the cover images via the work? Because a work is considered a top-level interesting kind of thing by Acropolis, presumably the work should be the thing which gets the mrss:content statement. Or should a user have to drill down through the editions collection, to an individual edition, then to the cover for that edition, then to the abstract media relating to that cover, and finally to embeddable files for that media? This seems a lot of work to get a book cover, so directly relating the work to its embeddable media seems more useful.

So how do we describe the chain between :MediaProxy and :WorkProxy using spindle:via rules? In this case, I think we need a structure which allows an arbitrarily-long, ordered series of steps between resources.

Do we need to consider rules which use spindle:via to point to an ordered list of transitions between resources via specific properties? (Though this has its own problems, as the rules may become domain-specific very quickly.)