Design approach to update graphs with atomic data

ColmMassey commented 4 years ago

At present, we always rebuild graphs from scratch. There are many reasons we will need to incremenally add/remove data from a graph.

sunnydean commented 4 years ago

Currently working on adding, removing AND editing

sunnydean commented 4 years ago

@ColmMassey BTW Do you have a specific case for removing data?

I can imagine using the lime survey for adding new data (1 entry at a time) or editing old data

Uploading new data to a graph (from an us or whoever else) should only be editing old entries and uploading new ones OR synchronizing their whole Dataset with the data in the graph (effectively removing the old graph and uploading a new one with the new data)

the synchronizing can be done by either adding/editing/deleting or uploading a new graph to replace the old one. I guess my question is what is specifically the point of doing this? What are we intending to achieve by implementing atomic updates rather than full synchronizations (like we are doing currently), Except making our code functional with updates from limesurvey ?

ColmMassey commented 4 years ago

We discussed this in a chat. Perhaps list here the use cases you have now designed solutions for.

sunnydean commented 4 years ago

Right, so after the chat, it seems that this issue might be a bit of a priority. The tasks for now:

Add new data:

generating a triple from the passed data
adding an entry to a specific table in the triplestore
generating a rdf file (with the given input) and storing it in the data folder for that table
generating a ttl file (with the given input) and storing it in the data folder for that table
generating a html file (with the given input) and storing it in the data folder for that table
adding an entry to the index.rdf file (that contains all the rdf data for a graph)
adding an entry to the index.html file (that contains all the html data for a graph)
adding an entry to the index.ttl file (that contains all the ttl data for a graph)

Edit data:

generating a triple from the passed data
find triple and edit the entry for a specific table in the triplestore
generating a rdf file (with the given input) and storing it in the data folder for that table and replacing the old one
generating a ttl file (with the given input) and storing it in the data folder for that table and replacing the old one
generating a html file (with the given input) and storing it in the data folder for that table and replacing the old one
replacing an entry to the index.rdf file (that contains all the rdf data for a graph)
replacing an entry to the index.html file (that contains all the html data for a graph)
replacing an entry to the index.ttl file (that contains all the ttl data for a graph)

Remove Data:

find triple and remove the entry from a specific table in the triplestore
removing the rdf file of an entry (for the given input)
removing the ttl file
removing the html file
removing an entry to the index.rdf file (that contains all the rdf data for a graph)
removing an entry to the index.html file (that contains all the html data for a graph)
removing an entry to the index.ttl file (that contains all the ttl data for a graph)

General Tasks: Generate and Secure this API with tokens flag/log that cache is outdated for all maps referencing this graph (and a system for the maps to update)

To achieve this, we essentially need to make a layer on top of Virtuoso: i.e. Request goes to API - API changes data on server and Virtuoso database

There are a couple of ways to go about this task:

Since we have a bunch of tasks to handle for rdf generation/changes to the current files/database and caching (which are all ruby code currently), I suggest we employ a simple ruby API that handles caching and requests. I suggest we use either ruby on rails (api only) or Sinatra, since our library is in ruby. It will work well on scale. I am also thinking about another potential solution which includes auto generating the html/rdf/ttl files from the database entries if this turns out not to be feasible.

The MVP of this task would be to make some php scripts that just update the database without signaling caching or editing files or security tokens. Handling caching and synchronizing the files with the database would require a bit more than php (i.e. we would save a lot of time by employing something else, especially if it is in ruby). I started working on the MVP but we do need to make a decision here. @ColmMassey

sunnydean commented 4 years ago

On further research, it might be worth compiling all the ruby to a dynamic library and writing the api in .Net/Java which is supported directly by Virtuoso. That way we add less redundancies to the whole project and we will start using some of the additional Virtuoso functionalities that are already there. (I know both .NET and java so we can use either)

sunnydean commented 4 years ago

Update: The sparql endpoint can be secured via Virtuoso using OAuth, tokens wouldn't be needed and we might skip some steps

sunnydean commented 4 years ago

@ColmMassey @wu-lee We should update our virtuoso server so it's at least able to use sparql 1.1 probably should be on a separate ticket

http://vos.openlinksw.com/owiki/wiki/VOS/UpgradingToVOS610#Upgrading%20from%20Release%206.x%20to%20Release%207.x%20or%208.x

ColmMassey commented 4 years ago

@ColmMassey @wu-lee

Sounds like we need a Zoom discussion for this one.

wu-lee commented 4 years ago

For logistical reasons, I'd very much prefer to install from the Ubuntu repository using apt than from a binary, whether from a 3rd party or built ourselves. The deployment of our servers would need to be rethought, and installing virtuoso was already the fiddliest part.

One way to do that might be to upgrade to a newer Ubuntu release (dev-0 and sea-0 are still on Xenial, which is quite an old long-term support release). Bionic Beaver is the latest LTS release, there is due to be a new LTS release right about now, Focal Fossa (but which doesn't seem to have appeared yet).

However, it looks like even Bionic only supports 6.1:

https://launchpad.net/ubuntu/+source/virtuoso-opensource

A second best might be from a 3rd party PPA, so the same install process can be used, with a bit of extra config to add the PPA.

There is talk of alternative PPAs on some Virtuoso issues like this:

https://serverfault.com/questions/631673/virtuoso-opensource-7-1-how-do-i-build-an-ubuntu-deb-package-from-github-sourc

However, this is old information from years ago and this repo is not available any more, it seems to have been abandoned.

We might want to see what comes out with Focal Fossa.

ColmMassey commented 4 years ago

I suggest @wu-lee & @dtmakm27 discuss to narrow down options and then pull me into a voice.

wu-lee commented 4 years ago

https://launchpad.net/ubuntu/+source/virtuoso-opensource has been updated, but still only virtuoso 6.1 on offer, Focal Fossa and Groovy Gorrilla.

wu-lee commented 4 years ago

At the moment, Lime Survey doesn't seem to make it easy to identify the changed parts of the response data, without downloading it and inspecting it, plus as far as I know (not knowing exactly what Dean has changed recently), the Ruby sausage machinery is designed to work on an entire graph at once.

I wonder if this task might require a non-trivial rewrite of the sausage machine? Is this warranted?

But if we were planning a rewrite, I was wondering if perhaps we might want to look at SOLID - as that seems to include a NodeJS based toolkit for RDF manipulation.

Confusingly, although it claims that "currently, our Solid servers support a subset of SPARQL 1.1", (1, 3) Solid doesn't seem to have implemented SPARQL - yet. Saying that, there is some talk of the "Small Data pattern" obviating the need for SPARQL, which I don't yet understand, but might be another angle we should know about.

ColmMassey commented 4 years ago

At the moment, Lime Survey doesn't seem to make it easy to identify the changed parts of the response data, without downloading it and inspecting it, plus as far as I know (not knowing exactly what Dean has changed recently), the Ruby sausage machinery is designed to work on an entire graph at once.

Limesurvey's limitations and the fact that it won't be used to collect large datasets means it shouldn't drive the usecases for atomic updating a graph.

ColmMassey commented 4 years ago

Confusingly, although it claims that "currently, our Solid servers support a subset of SPARQL 1.1", (1, 3) Solid doesn't seem to have implemented SPARQL - yet. Saying that, there is some talk of the "Small Data pattern" obviating the need for SPARQL, which I don't yet understand, but might be another angle we should know about.

We need to talk to Happy Dev about this I suggest. I expect we won't have time for this in May, so lets schedule a meeting in early June.

DigitalCommons / open-data

Design approach to update graphs with atomic data #9