Closed cgueret closed 8 years ago
ok, the thing to look at here is what the triggers actually are, but the culprit will be the fact that this https://github.com/bbcarchdev/spindle/blob/develop/twine/generate/triggers.c#L90 is indiscriminate, and maybe needs to be made a bit smarter so that we track which triggers are freshly-added and apply those, rather than all of them (noting that when the flags are -1
, indicating that the proxy's been completely re-built, all of the triggers are and should continue to be considered 'freshly-added')
Ok! We'll have a look into it.
When ingesting the Shakespeare dataset it was the central #terms that was causing this issue. Every process of a new image or video would trigger an update on terms, which would in turn trigger an update on all the images and videos pointing to it. Acropolis was doing several steps back for every single step forward. After some testing we realised that we could prevent this from happening by being more cautious about what can trigger what in terms of updates. If we prevent, say, a MEDIA update to trigger a MEMBERSHIP update the problem is gone.
The relevant commits are: https://github.com/bbcarchdev/spindle/commit/613f48f3e8257a24c24fdce2dc622ca071be7e67 https://github.com/bbcarchdev/spindle/commit/1206bd78f0430b78ca9d40b3a6d9fc1f9ffe3b5f
When looking at the status of all the proxy entities, it can be observed that the number of "completed" proxies drops regularly: Using a PSQL interface to browse the data it can be observed that some triggers are set to refresh a lot of resources: In particular the following sequence generates the chain-saw pattern:
When everything except _void-terms.nq from the shakespeare data is ingested the problem is gone and all the proxies are steadily processed until completion. That is, removing all the triples having "http://data.vm-10-100-0-20.ch.bbcarchdev.net/terms#id" has a subject "solves" the issue at hand.