bbcarchdev / spindle

RES Linked Open Data aggregation engine
https://bbcarchdev.github.io/spindle/
Apache License 2.0
2 stars 1 forks source link

Loops in triggers #94

Open cgueret opened 7 years ago

cgueret commented 7 years ago

When ingesting a data using a resource as the named graph a loop is created in the triggers. Example: the following file sample.txt generates:

screen shot 2016-10-14 at 10 07 51
cgueret commented 7 years ago

Patch proposed in https://github.com/bbcarchdev/spindle/tree/SPINDLE-94

CygnusAlpha commented 7 years ago

Stage https://github.com/bbcarchdev/spindle/commit/2a2606b88cbc7f45e62ae5e6988cf4c9a75b9a33

cgueret commented 7 years ago

The main issue here is the named graph in the sample set as "http://example.org/works/as_you_like_it#id". If in the sample the named graph is updated to "http://example.org/works/as_you_like_it" there is no issue.

The triple "http://example.org/works/as_you_like_it http://xmlns.com/foaf/0.1/primaryTopic http://example.org/works/as_you_like_it#id" creates a membership trigger (because http://example.org/works/as_you_like_it#id partOf http://example.org/works/as_you_like_it). Both "http://example.org/works/as_you_like_it" and "http://example.org/works/as_you_like_it#id" also get triggers from the named graph. If that named graph is "http://example.org/works/as_you_like_it#id" we get "http://example.org/works/as_you_like_it" partOf "http://example.org/works/as_you_like_it#id" : a loop.

cgueret commented 7 years ago

Rather than preventing the creation of loops we can implement a system to make the processing stop at some point. Capturing some mail discussion:

The fix for the ones you're seeing now is pretty trivial, it's just not the fix you've proposed - there'll be activity on that later on.

The issue we'll see brought about by the below could theoretically happen already but is made more likely by the below.

The fix for those is to break down what a normal cyclic dependency check would do into stages which can execute in parallel and with big gaps. Thus, what would be a loop becomes the normal processing flow; what would be a variable becomes a column in the database tables.

Given that, we need to do two things: perform some reference counting and checking (trivial), and determine when the reference-count should be decremented versus when it should be left alone (because recursion is occurring, and less trivial).

On the first, this is simply the case of adding a reference count to the state table, skipping (without aborting) processing if the value is already above a certain threshold.

In a normal single-thread circular dependency algorithm you track which nodes you've 'unresolved' (currently evaluating) and which you've 'resolved' (have evaluated successfully); you iterate each node in turn, adding it to the 'unresolved' list when you begin evaluating; for each of its outbound edges (directly-connected nodes), if the edge isn't 'resolved', but is 'unresolved', then you throw a circular reference error. If all of the outbound edges have been resolved, or have not been seen, you remove the node being processed from the 'unresolved' list and add it to the 'resolved' list.

We don't have the ability to do that because the graph can be infinitely large. Thus, we need to instead be selective about where and under what conditions the reference count is decremented.

There are variants of the above algorithm - for example, make will use reference-counting and a threshold instead of a binary flag (for similar reasons to us), but the principle is generally the same.

Therefore, we need to figure out what conditions the reference-count on an entity being processed should NOT be decremented. I would propose that the following occurs:

  1. when processing begins, check the reference count in the state table and semi-silently skip processing and further triggering when the count reaches our threshold (which should be configurable, and have a default value greater than one). I say semi-silently because we should log something, but not fail.
  2. increment the reference count of the proxy by the count of the triggers which would be activated as a result of our processing (which may be slightly more than the number of triggers which are actually activated due to existing states meaning the queue update is a no-op)
  3. when processing is completed (whether skipped or not), decrement the count of each inbound trigger's originating proxy by one.

This would also solve the issue we've got in flight at the moment, but in a sledgehammer-to-crack-a-nut sort of way which masks the actual problem.

cgueret commented 7 years ago

The revert https://github.com/bbcarchdev/spindle/commit/df10c5ab5e6007a2e200e04988f7f3d2d1dd7129 actually removed a necessary fix and now the loops are back. The fix https://github.com/bbcarchdev/spindle/commit/5280c2dec6b69d06e555d8118b769244301807d4 is a good thing but this does not solve the issue at hand :-(

nevali commented 7 years ago

The proposed fix in df10c5a breaks certain kinds of trigger relationship (where there's a codependence but the flags are different).

Can you elaborate on why 5280c2d doesn't solve the issue at hand? Elliot did an ingest on Thursday with this version of the code which was previously looping.

cgueret commented 7 years ago

The issue is explained in length earlier in this thread. The problem is that because every resource becomes a part of the information resource describing it, and because in the case of the test sample that information resource is actually the non information one, this creates a loop. Here is another screenshot of what I currently have in the triggers:

screen shot 2016-11-21 at 10 11 52

See in particular the lines ee4 -> 8c6 and 8c6 -> ee4.

nevali commented 7 years ago

Ah, yes.

Okay, two moderately quick fixes needed — one is to apply the fix from df10c5a to membership, not to triggers

The other is the named graph fragment fix (which will need a bit of care and attention to the database to resolve).

Once that's done, then a proper implementation of refcounting circular dependency tracking can be implemented.

rjpwork commented 7 years ago

Is there a document or specification anywhere about how the "refcounting circular dependency tracking" will work?

nevali commented 7 years ago

It's described above

townxelliot commented 7 years ago

I did an ingest of some Shakespeare data from Richard Light last week, which was OK. But I think this is because I created the nquads myself, and did them in such a way that I stripped /#.+$/ from resource URIs to get the named graph URIs. This prevents the situation Christophe describes, as information resources (without a #) will end up in the same graph as non-information resources (with #id), instead of in a circular relationship.

I've since tried to ingest some Shakesepare data from the dataset-sampler project, which comes from shakespeare.acropolis.org.uk. This data does appear to have triples in a relationship which causes the loop, as described by Christophe.

The attached file is a minimal case which demonstrates the problem (attempting to import these two triples never terminates): loop_creator.nq.txt

rjpwork commented 7 years ago

Oh, yes, I was hoping for more of a formal / pseudocode kind of specification + some examples of how it would work. The description above is a bit vague and difficult to visualise without a worked example showing what happens when and how it prevents loops.

nevali commented 7 years ago

It might be difficult to visualise, but the vagueness is troubling me, can you elaborate?

rjpwork commented 7 years ago

"when processing begins, check the reference count in the state table" - doesn't specify which reference count (assumption is that it's the URI you're about to create triggers from?)

"skip processing and further triggering when the count reaches our threshold" - what happens if the threshold is 4 and we need to trigger 5 things from this URI? Is the count evaulated on every potential trigger or just at the start of processing a URI?

"which may be slightly more than the number of triggers which are actually activated" - if the reference count goes up by 5 but only 3 triggers are activated, how does the extra 2 get accounted for?

"when processing is completed (whether skipped or not)" - how can you complete skipped processing?

"decrement the count of each inbound trigger's originating proxy by one" - if X triggers A, B, C, then presumably processing A, B, C, will lead to a decrement on X because A->X, B->X, C->X, etc? Or is this suggesting that X triggering A leads to a decrement on A?

A worked example would help to clarify these a lot.

nevali commented 7 years ago

@rjpwork - happy to clarify

well, the processing cycle in spindle-generate is always focussed around entities to which we've assigned proxy UUIDs, and and those are are what's tracked by the state table (this was the context in which I said “add a reference count”)

as this is in the context of generate a proxy, "skip processing" expands to "skip the contents of spindle_generate()"

"which may be slightly more than…” —the only reason you'd skip applying a trigger is because it's already been triggered

"when processing is completed (whether skipped or not)" means even if you skip the bulk of spindle_generate(), the step needs to still occur (i.e., equivalent to a finally block).

because triggers are always directional (i.e., proxy X triggers updates on A, B, and C), inbound trigger implies the reverse; thus at the end of processing of each of A, B and C would be decremented.

cgueret commented 7 years ago

With respect to this test example it turns out it did not came from the Shakespeare data dump but was generated, with errors, from another script. So the Shakespeare data is actually fine and there is nothing to fix there...

rjpwork commented 7 years ago

@nevali Ta. Just a couple more.

"thus at the end of processing of each of A, B and C would be decremented" - presumably this is decrementing X at the end of processing each of A, B, C.

What happens if we increment X by 4 (because we'd activate 4 triggers) but only 3 triggers are actually activated? Can this happen?

Does this algorithm work in isolation or does it need to include things like "if we've already got A:X as DIRTY in the state table, ignore another A:X trigger"? Or even if A:X is COMPLETE? Or both?

Apologies for the questions but without a state diagram or worked example, it's difficult to visualise how it works and how it prevents loops.

nevali commented 7 years ago

@rjpwork yes, sorry, decrementing X.

mismatched increments and decrements shouldn't happen; the exception is if we decide a trigger is actually invalid (e.g., it's an item triggering itself), at which point we should just ignore it.

This particular algorithm should work in isolation — it certainly ought not to be status-dependent.

Don't apologise for the questions — although tracing the flow on paper to aid visualisation might be a good way to quickly get up to speed with what's going on generally!

rjpwork commented 7 years ago

@nevali Is it possible that processing proxy A can queue up a trigger (or set of triggers) which then starts processing the next proxy B before the full set is read from proxy A? ie. the reference count can be "partial"?

rjpwork commented 7 years ago

Right, I'm going to work through this graph and see if I can follow this algorithm.

+-----+           +-----+
|     |           |     |
|  E  +----------->  A  |
|     |           |     |
+--^--+           +--+--+
   |                 |
   |     +-----------------------+
   |     |           |           |
   |  +--v--+     +--v--+     +--v--+
   |  |     |     |     |     |     |
   +--+  B  <-----+  C  |     |  D  |
      |     |     |     |     |     |
      +-----+     +-----+     +-----+
townxelliot commented 7 years ago

There is still a loop when ingesting certain types of data, e.g. where there is a non-information resource like:

<http://example.com/foo>

and an associated information resource like:

<http://example.com/foo#data>
  foaf:primaryTopic <http://example.com/foo>

I've attached a set of n-quads in this kind of relationship, which causes a loop to occur. Ingesting this with twine on the command-line should demonstrate the issue.

12night.original.nq.zip

cgueret commented 7 years ago

In those quads having http://richardlight.org.uk/Plays/shakespeare/id/12night as a named graph surely does not help. But yes otherwise Acropolis can not process information resources using a fragment. It will strip the fragment out and use the base as the resource name, potentially then creating a loop when assuming the subjects of the named graph are partOf that named graph.

nevali commented 7 years ago

@townxelliot okay, there'a GIGO issue in that case which we should prevent from cascading (that actual triple is nonsense and we should arguably special-case triples matching that general pattern and ignore them).

rjpwork commented 7 years ago

For the web-semantically challenged amongst us (ie me), why is that triple nonsense? (Admittedly, this ticket isn't the best place for this discussion but since it came up here...)

nevali commented 7 years ago

because it's back to front; it states that something that can only be a non-information resource (i.e., a conceptual thing of some kind) has as its primary topic something which we know must be an information resource — the latter is in principle valid in the general case, the former isn't. It's saying "my chair is about this invoice" rather than "this invoice is about my chair".

townxelliot commented 7 years ago

I'd argue that the triple isn't nonsense: there are plenty of instances where non-information resources have non-hash URIs, and information resources have hash URIs, whether or not that's correct according to semantic web conventions (where it's usually non-information resources which have hash URIs).

(Note that http://example.com/foo#data is an information resource, so it can have the non-information resource http://example.com/foo as its primaryTopic. It's just the URIs which are "wrong".)

For example, I'm not sure of the status of terms in an ontology (whether they are information or non-information resources; I'd say non-information), but I've seen both non-hash URIs (http://xmlns.com/foaf/0.1/Agent) and hash URIs (http://purl.org/NET/c4dm/event.owl#Event) used to define them. They can't both be "correct" by your definition.

Also, there are bound to be cases in the wild where the two are mixed up or used incorrectly (according to the convention), so some guard against this seems advisable.

nevali commented 7 years ago

http://example.com/foo#data cannot possibly be an information resource in RES terms, because it's not a resource URL. a request is never made for that URL.

nickshanks commented 7 years ago

I agree with one thing @nevali said; the URI <http://example.com/foo#data> cannot represent a retrievable information resource (IR). My proposal is therefore that any triple which indicates it is an IR (as subject or object) should be discarded. If other statements contradict this, and indicate that it is not an IR, this would still cause the URL to be queued and crawled. This would require that the software know which predicates indicate that the subject or object is an IR, and not blindly queueing every URL it comes across. Doing this, however, would not stop the Anansi crawler from looping, it would only break the referential loop in the triple store for predicates we care about. Is that sufficient to fix this ticket?

It would also not resolve logic errors for other predicates which we don't understand, for example when:

<a> foo:contains <b> .
<b> foo:contains <a> .

is defined to be illogical by the foo namespace, yet:

<a> owl:sameAs <b> .
<b> owl:sameAs <a> .

is fine.

Having not been aware of this issue until yesterday, and not yet seen any code, I expect I have misunderstood at least two or three important points somewhere along the line so please point them out to me.

townxelliot commented 7 years ago

Just to be devil's advocate: we may come to a point in the future where non-information resources (dogs, people) can be transferred over the internet as information. Even within my lifetime, books (in the sense of "a thing your eyes can physically read") have gone from being non-information resources to information resources. What happens to the NIR hash URIs when the objects they denote can be transmitted as information? Surely the NIR/IR distinction is purely a technological one?

nickshanks commented 7 years ago

I don't agree that books have gone from non-information resources to information resources. The information resource you're referring to is a digital encoding of the text and images within a particular book (perhaps even a specific edition or publication of a book). Editions and publications are intangible non-information resources. Tangible non-information resources would include "this copy of such and such, which I've owned since I was eight and once dropped in a puddle". All four of those would have different URIs, and only one is retrievable.

nevali commented 7 years ago

@townxelliot The point remains that a HTTP/HTTPS URI with a fragment cannot ever in LOD terms be an IR because the protocol doesn't work that way.

@nickshanks FWIW The crawler itself isn't relevant here, this is purely about the data processing.