Query the Islandora triplestore without having direct access to the triplestore

mjordan commented 6 years ago

Title (Goal)	Query the Islandora triplestore without having direct access to the triplestore
Primary Actor	Developer
Scope	Specific service
Level	High
Story	As a developer, I want to be able to perform SELECT queries against Islandora's triplestore, but the sysadmin has locked out all access to the triplestore other than from the Drupal server (which is probably a very good idea). This could be accomplished by having Drupal act as a proxy to the secured triplestore. Queries would be issued to a Drupal via HTTP, which would then pass them on to the triplestore's SPARQL endpoint, which in turn would return results to Drupal and eventually to the requesting client. Of course, Drupal would control access to this proxy.

mjordan commented 6 years ago

We could have a similar use case for querying Solr as well.

whikloj commented 6 years ago

@mjordan this seems like a bad idea, its essentially making an end around the desires of your sysadmin no?

Would this not be better handled by having your sysadmin allow certain IPs or ranges access to the sparql endpoint?

If there is a Drupal based query front-end then I could see using that for end-users, but what I think you want could be achieved with an Apache ProxyPass statement.

mjordan commented 6 years ago

@whikloj sorry, I didn't mention that my sysadmin suggested the proxy approach since it allows us to control access using Drupal permissions rather than configuring firewalls, etc.

I'm not familiar with how you'd use ProxyPass in this case, but that approach is definitely worth looking into.

whikloj commented 6 years ago

@mjordan ok, I think I misunderstood. So this would be a Drupal REST endpoint that if you had the correct permissions would forward the body of your request onto the a configured endpoint?

Because we don't actually have the URL of the triplestore in Drupal yet this could easily be a standalone module.

How fine-tuned would you want the permissions? For instance would it need to check for INSERT and/or DELETE statements? Would an interface in Drupal be okay, or does it need to be available for curl requests from outside with appropriate credentials?

mjordan commented 6 years ago

So this would be a Drupal REST endpoint that if you had the correct permissions would forward the body of your request onto the a configured endpoint?

That's right, sorry if I didn't explain that clearly. It would need to be available to external clients with appropriate credentials.

WRT permissions, I don't really know. I can see the usefulness of SELECTing but would we ever want to update the triplestore without going through the node CRUD REST interface, since updates would be associated with nodes? Maybe we need sub-use cases for write operations to the triplestore.

ajs6f commented 6 years ago

This reminds me a great deal of Fedora 2/3's Resource Index endpoint. The idea behind that was to provide a layer of indirection around triplestores so as to let clients be agnostic to the specific triplestore in use. This was pre-SPARQL, so there wasn't an obvious choice as to triplestore API/query language; now there is, of course.

So one question I have about this idea is: is the use case to be able to fire arbitrary queries at the store? Or specific queries that are already known at the time of deployment (e.g. specific queries to support specific applications)? In the former case, would something like Linked Data Fragments fit the bill? (It's much more tractable than a full-range SPARQL Query endpoint, and often easier to secure.) In the latter case, is what's needed here a mapping from a set of given predefined queries to a set of endpoints where the results of those queries can be found? There are toolkits for that sort of work.

whikloj commented 6 years ago

@ajs6f I am not well versed (or really versed at all) in Linked Data Fragments, could you give it a 10,000ft view and/or provide a good starting point? I'm guessing that @mjordan would prefer to be able to send any query to the triplestore to support his development work.

ajs6f commented 6 years ago

http://linkeddatafragments.org/

It's a limited form of query language. It doesn't offer all the bells and whistles of full SPARQL Query, and because of that, it is much easier to impl (Trellis does it out of the box) and it's much harder to write queries in LDF that will blow up and destroy the server.

DiegoPino commented 6 years ago

Even when i'm a big advocate of LDF (this would be not useful for the update operations @mjordan refers to, but could be good to speed up queries) i'm not aware of any PHP clients in the wild, being the most popular ones JS. What would the benefit of using Drupal to expose the server in that case? Or are you thinking on coding/writing something on Drupal that can make use of LDF also and display inline, expose results via API? http://linkeddatafragments.org/software/#client

ajs6f commented 6 years ago

@DiegoPino I wasn't suggesting that CLAW act as a client against an LDF store-- I was suggesting that, rather than re-expose full SPARQL from the triplestore, CLAW would impl LDF (since it's very easy to translate LDF to SPARQL, which is far more expressive). It's not a complete solution to @mjordan's use case, it just ameliorates some of the dangers and problems of opening endpoints for arbitrary queries (e.g. I can easily write a query that will crush an arbitrarily large SPARQL engine).

whikloj commented 6 years ago

So if I am understanding this correctly. This could be a way to give a simple interface to a triplestore (if you use the Triple Pattern Fragment).

So a request has to return some of the triples, an estimated count of the total number of triples and some metadata to alter your query.

I'd like to hear what @mjordan thinks about this, but it seems like a less dangerous way to provide direct access to the triplestore.

I see there is a server implementation in PHP (https://github.com/tdt/triples). It seems to store the triples in MySQL which could be problematic but could be a starting point.

mjordan commented 6 years ago

Sorry to be out of the loop, was in a dentist's chair most of the morning, been recovering since.

I need to learn more about LDF, but my intent was full SELECT capabilities against the triplestore. Some specific tasks that I might want to accomplish include:

pointing http://hdlab.stanford.edu/palladio/ at my endpoint
query the triplestore to get all nodes in a collection and its subcollections
query the triplestore to get all binary files bigger than 1 GB (assumes size property is stored)
query the triplestore to get a list of new nodes added yesterday

Some of these things might be possible using Views, but with 7.x I've had good luck using its REST interface, with the accompanying solr endpoint, to derive repository-management lists using external (non-Drush) script.

To respond to @ajs6f's warning about being able to issue a query that puts the triplestore on its knees, granted, that would be possible, but I don't think that possibility warrants only providing a subset of queriability. Judicious use of access controls provided by Drupal would mitigate that risk, I'd hope. Also, I would see this proxy (or whatever) module as optional, so it's not like it would be accessible in every instance of Islandora.

Maybe there's room for both an LDF interface and an open SPARQL endpoint, assuming that both could be implemented as contrib modules.

Islandora / documentation

Query the Islandora triplestore without having direct access to the triplestore #807