NRGI / resource-projects-etl

ETL processes for rp.org
GNU General Public License v2.0
3 stars 2 forks source link

Virtuoso SYS_SPARQL_HOST does not behave correctly #48

Open Bjwebb opened 8 years ago

Bjwebb commented 8 years ago

To set up virtuoso to have live/staging endpoints, you must run something like:

DB.DBA.RDF_GRAPH_GROUP_CREATE('http://staging.resourceprojects.org/data/', 1);
DB.DBA.RDF_GRAPH_GROUP_CREATE('http://resourceprojects.org/data/', 1);
 insert into DB.DBA.SYS_SPARQL_HOST (SH_HOST, SH_GRAPH_URI) values ('%live%', 'http://resourceprojects.org/data/');
insert into DB.DBA.SYS_SPARQL_HOST (SH_HOST, SH_GRAPH_URI) values ('%staging%', 'http://staging.resourceprojects.org/data/');

This only needs to be run once for that virtuoso instance.

However http://staging.resourceprojects.org/data/ seems to show stuff belonging to both graph groups.

Bjwebb commented 8 years ago

This is now automated for the tests.

We should also automate/document the dev/live setup.

The staging url is still not behaving correctly. If it was, this if statement in the tests could be removed - https://github.com/NRGI/resource-projects-etl/blob/5975453710205ba3a71cbfdbaed2308f4b9d4113/fts/test_dataload.py#L46

Bjwebb commented 8 years ago

Looks like the problem here is that only the entry that appears first in DB.DBA.SYS_SPARQL_HOST (alphabetically) is paid any attention to. I don't think this is the expected behaviour.

I'm going to try updating virtuoso to see if this is fixed, otherwise I'll try reporting upstream.

Bjwebb commented 8 years ago

Neither the latest stable 7 branch, nor the develop fixes this problem. I might try the stable 6 banch to see if that behaves any differently, and then file an issue upstream.

@timgdavies Do you know anyone who's got this working, and what version of Virtuoso they were using? I came across https://github.com/neontribe/Linked_Development/blob/master/Linked_Development/overlay/opt/tools/multi-end-points.isql, but I can't even replicate that exact setup.

timgdavies commented 8 years ago

See http://sourceforge.net/p/virtuoso/mailman/message/26203357/ which mentions including port numbers. Worth trying?

I believe we had this working in the Neontribe case you link to... but I know there were some challenges around this.

Just to check:

If the later, then is there a workaround of us writing the queries to select from a given graph?

Bjwebb commented 8 years ago

It is the latter, so writing queries to select from the appropriate graph group is a possible workaround.

Running two copies of virtuoso would be another approach (without the need to add every query, but with extra resource requirements).

timgdavies commented 8 years ago

I would certainly think keeping to one virtuoso instance makes more sense for simplicity of deployment - particularly as we might be scaling up.

We can define default graph (equivalent of dropping in 'from') using a Virtuoso pragma so I would suggest we updated our queries to include that.

I.e. at the top of each query, depending on whether it is staging or live being accessed, we add:

define input:default-graph-uri <http://resourceprojects.org/data/>

or

define input:default-graph-uri <http://staging.resourceprojects.org/data/>

I think we can set the default graph that Virtuoso will use in queries at /sparql/ somehow - so we could set that to live, but document how people can change that to staging if they need to.

Bjwebb commented 8 years ago

Actually, after some brief digging in the lodspeakr source code, it looks like we should be able to do this directly in the settings file, which avoids having to modify every single query file:

$conf['endpointParams']['config'] = array(
    'default-graph-uri' => 'http://staging.resourceprojects.org/data/'
);

I'll look into setting this up.

I'll also set the /sparql endpoint to default to live, as you suggest.

Bjwebb commented 8 years ago

We have a working setup now by configuring lodspeakr in this way, I'm planning on leaving this issue around, to possibly look into what's going wrong with the virtuoso setup in future.

Extra steps before I bump this back to the Ready column: