blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
891 stars 172 forks source link

Accessing blank nodes through the REST API #129

Open pchampin opened 5 years ago

pchampin commented 5 years ago

I'm trying to build a Python wrapper around the REST API of Blazegraph. My problem is that blank node identifiers returned by GETSTMTS can not be injected back in GETSTMTS to get more info on that node. GETSTMTS accepts only URIs and literals.

That's a problem because it makes it impossible to navigate freely through a graph that contains blank nodes. Would it be possible to allow blank node labels in the parameters of GETSTMTS?

I imagine that it would be tricky when inference is involved, which may create "transient" blank nodes. Alternatively, would it be possible to have an option to skolemized the blank node that are statically stored in the database, so that they can be queried as any other node?

thompsonbry commented 5 years ago

This might be relevant:

The option is defined by AbstractTripleStore:

/**

On Thu, Mar 28, 2019 at 5:52 AM Pierre-Antoine Champin < notifications@github.com> wrote:

I'm trying to build a Python wrapper around the REST API of Blazegraph. My problem is that blank node identifiers returned by GETSTMTS https://wiki.blazegraph.com/wiki/index.php/REST_API#GETSTMTS can not be injected back in GETSTMTS to get more info on that node. GETSTMTS accepts only URIs and literals.

That's a problem because it makes it impossible to navigate freely through a graph that contains blank nodes. Would it be possible to allow blank node labels in the parameters of GETSTMTS?

I imagine that it would be tricky when inference is involved, which may create "transient" blank nodes. Alternatively, would it be possible to have an option to skolemized https://www.w3.org/TR/rdf11-concepts/#section-skolemization the blank node that are statically stored in the database, so that they can be queried as any other node?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/129, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4KZcEyGgAhRzNP3ctOahMdo6vBiGks5vbLsigaJpZM4cQFcU .

pchampin commented 5 years ago

Thanks @thompsonbry, it does look like it indeed.

So I created a namespace with this option set to true. When I insert bnodes, they have different looking ids (genid-d729b1f2198e4d4fa7b63477c65def4b-a insread of t30), so it changes something.

But unfortunately I still can't query them with the REST API... I still get an error 400 Bad Request when passing a bnode ID to GETSTMTS...

thompsonbry commented 5 years ago

This might be how you are forming the parameters. The REST API doc should specify the appropriate means to quote things.

Do you have an error message from the server log or the HTTP request?

On Thu, Mar 28, 2019 at 12:26 Pierre-Antoine Champin < notifications@github.com> wrote:

Thanks @thompsonbry https://github.com/thompsonbry, it does look like it indeed.

So I created a namespace with this option set to true. When I insert bnodes, they have different looking ids ( genid-d729b1f2198e4d4fa7b63477c65def4b-a insread of t30), so it changes something.

But unfortunately I still can't query them with the REST API... I still get an error 400 Bad Request when passing a bnode ID to GETSTMTS...

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/129#issuecomment-477738138, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4B4V2ONpdSfYNFgdTT9DHGMGOyIDks5vbRd2gaJpZM4cQFcU .

thompsonbry commented 5 years ago

Also provide a curl command which replicated the problem. There are examples on that same API page.

On Thu, Mar 28, 2019 at 15:03 Bryan B. Thompson thompsonbry@gmail.com wrote:

This might be how you are forming the parameters. The REST API doc should specify the appropriate means to quote things.

Do you have an error message from the server log or the HTTP request?

On Thu, Mar 28, 2019 at 12:26 Pierre-Antoine Champin < notifications@github.com> wrote:

Thanks @thompsonbry https://github.com/thompsonbry, it does look like it indeed.

So I created a namespace with this option set to true. When I insert bnodes, they have different looking ids ( genid-d729b1f2198e4d4fa7b63477c65def4b-a insread of t30), so it changes something.

But unfortunately I still can't query them with the REST API... I still get an error 400 Bad Request when passing a bnode ID to GETSTMTS...

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/129#issuecomment-477738138, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4B4V2ONpdSfYNFgdTT9DHGMGOyIDks5vbRd2gaJpZM4cQFcU .

pchampin commented 5 years ago

Here is the command line I use

curl 'http://localhost:9999/blazegraph/namespace/test-told-bnodes/sparql?GETSTMTS&s=_:genid-2dfec253ba26d54b1a9d13d87c46d8a9a12d-x' -i

and the result I get

HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=ISO-8859-1
Transfer-Encoding: chunked
Server: Jetty(9.2.z-SNAPSHOT)

_:genid-2dfec253ba26d54b1a9d13d87c46d8a9a12d-x

No error message on the server log.

This might be how you are forming the parameters.

I thought about that and tried different variants, to no avail. In particular, I tried wrapping the bnode ID inside pointy brackets, alla URI. Then I get a 200 Ok response, because syntactically, this is correct, but with no triple returned, although this bnode is involved in some triples.

thompsonbry commented 5 years ago

(Olaf: see question below about LDF and blank nodes. )

Ok. Can you file a jita ticket for this? I will pull up the code for the parameter parsing and take a look. I did look again at the documentation and (counter to my memory) it indicates only URI or Literal for the access path based APIs.

One last thought. Have you considered implementing your wrapper using DESCRIBE? Describe supports a number of semantics, including SCBD (symmetric concise bounded description). In fact, that might be why the told bnode semantics are not supported here.

Using Consise Bounded Description any blank nodes are automatically traversed. This is basically an iterative process internally. The result will always consist solely of URIs and Literals. Symmetric CBD does the same thing, but the traversal considers not only the property set and link set of the subject, but also the incoming edge list. Hence symmetric.

The whole purpose of CBD is to find a description of a given subject in which all blank nodes have been resolved to ground data. The result does contain internal blank nodes if there were blank nodes in the data, but the data for those blank nodes has already been preresolved.

You can setup BG (blazegraph) to use the desired DESCRIBE semantics by default or simply provide a query hint with the request indiciating which kind of DESCRIBE you want.

I am curious how linked data fragments deals with blank nodes. Adding Olaf to the Cc for that.

Bryan

On Thu, Mar 28, 2019 at 23:52 Pierre-Antoine Champin < notifications@github.com> wrote:

Here is the command line I use

curl 'http://localhost:9999/blazegraph/namespace/test-told-bnodes/sparql?GETSTMTS&s=_:genid-2dfec253ba26d54b1a9d13d87c46d8a9a12d-x' -i

and the result I get

HTTP/1.1 400 Bad Request Content-Type: text/plain; charset=ISO-8859-1 Transfer-Encoding: chunked Server: Jetty(9.2.z-SNAPSHOT)

_:genid-2dfec253ba26d54b1a9d13d87c46d8a9a12d-x

No error message on the server log.

This might be how you are forming the parameters.

I thought about that and tried different variants, to no avail. In particular, I tried wrapping the bnode ID inside pointy brackets, alla URI. Then I get a 200 Ok response, because syntactically, this is correct, but with no triple returned, although this bnode is involved in some triples.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/129#issuecomment-477889330, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4Nh4eJxqXX-HcNdhpZ4dFXqptQUzks5vbbgkgaJpZM4cQFcU .

pchampin commented 5 years ago

I'm assuming that my LDF, you are referring to Triple Pattern Fragment. They do not support bnodes either. I think their main argument is "this is linked data, you can not link to bnodes, so don't use bnodes".

Before I file a ticket on JIRA, I'll post another more general issue about the "told bnodes" mode. I need to better understand it before going further.

thompsonbry commented 5 years ago

Ok. But do take a look at the (S)CBD DESCRIBE mode. This might solve your problem without breaking the semantics of blank nodes.

See https://wiki.blazegraph.com/wiki/index.php/LinkedData for blazegraph linked data options, including (S)CBD, how to configure this for a given namespace, and how to how the relevant query hints.

Also see https://wiki.blazegraph.com/wiki/index.php/QueryHints for the query hints related to describe:

describeMode Query Specify the algorithm for a DESCRIBE query. SymmetricOneHop|CBD|SCBD) (SymmetricOneHop) describeIterationLimit Query Specify the maximum #of iterations for an iterative DESCRIBE algorithm (CBD, SCBD) -or- ZERO (0) for no limit. Note that BOTH the iterations and statements limits must be reached before a DESCRIBE query will be terminated. xsd:int (5) describeStatementLimit Query Specify the maximum #of statements in a DESCRIBE query result for an iterative DESCRIBE algorithm (CBD, SCBD) -or- ZERO (0) for no limit. Note that BOTH the iterations and statements limits must be reached before a DESCRIBE query will be terminated. xsd:int (5000)

Bryan

On Fri, Mar 29, 2019 at 7:37 AM Pierre-Antoine Champin < notifications@github.com> wrote:

I'm assuming that my LDF, you are referring to Triple Pattern Fragment. They do not support bnodes either http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/#definition. I think their main argument is "this is linked data, you can not link to bnodes, so don't use bnodes".

Before I file a ticket on JIRA, I'll post another more general issue about the "told bnodes" mode. I need to better understand it before going further.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/129#issuecomment-478021084, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4CHUmuQeP4yuqRo9v8SaXbOH74Tpks5vbiUhgaJpZM4cQFcU .

pchampin commented 5 years ago

Thanks, I'll have a look, but I don't think it'll solve my problem. I don't only need to get the triples around the bnodes, but also to be able to alter them.

My use case is to use blazegraph as a backend of a Python RDF model, so somehow I need to break the blank node semantics, because I need to be able to handle bnodes just as any other node.

The alternative would be to handle this in my Python, doing the (de)skolemization there and communicating only URIs and literals with Blazegraph. But that has other drawbacks...

hartig commented 5 years ago

Yes, I can confirm that in the TPF work we have explicitly ignored blank nodes and, thus, the TPF interface does not support them. The reason has nothing to do with "linking to blank nodes" but simply that blank node labels are something specific to a serialization. The underlying data source / storage component of a TPF server may not even maintain specific labels for blank nodes. As a consequence, we cannot assume that some blank node label in a serialization of a TPF response may later be used to find the corresponding blank node in the store.

pchampin commented 5 years ago

@hartig

The underlying data source / storage component of a TPF server may not even maintain specific labels for blank nodes

Oh, right! I didn't think about that. So indeed, TPF (as an abstract API) can not rely on bnode identity.

OTHO, I'm assuming that Blazegraph has a stable identifiers for the bnodes it stores. So it would be nice if its own specific API would allow to handle them explicitly -- even if that involved skolemization, so that formally, only IRIs would be exchanged.

thompsonbry commented 5 years ago

I think there is a broader problem here with how to update SPARQL endpoints / RDF databases. SPARQL of course has the DELETE/INSERT WHERE, which is patterned after SQL. This is the only mechanism which allows an application to consider what is in the database at the same time as it applies an update. This is quite important for transactional workloads. Without that, you need to read the database state, figure out the delta in the appplication, and then ship the delta to the database. To make that atomic, you need to have a single writer pattern.

However, SPARQL UPDATE Is not well suited to dealing with graphs, only with triples and quads. And there is no means readily available to dealing with blank nodes. In particular, there is no means available to compute the CBD from within SPARQL UPDATE which makes it difficult to operate in terms of RDF “molecules” - things which more of less correspond to descriptions of objects and which do not require clients to traverse blank nodes on a remote server - something which the various specs straight out do not support givenbblank node semantics. However, if we were able to process the data within a single transaction on the server the blank node labels would be stable within that context.

There are various proposals for managing a single transaction which involves several requests, but the exceedingly high latency of the round trips between the client and the server will decrease the available concurrency due to the increasing likelihood of write retire conflicts over such slow transactions.

A lot of applications try to deal with this using named graphs as the container for the subgraph which they then drop/add within a transaction. That’s fine, but the granularity of named graphs is then restricted to a single purpose leaving out other equally useful applications for containers.

All in all, I think we need some better interfaces for managing RDF data.

Bryan

On Wed, Apr 3, 2019 at 02:28 Pierre-Antoine Champin < notifications@github.com> wrote:

@hartig https://github.com/hartig

The underlying data source / storage component of a TPF server may not even maintain specific labels for blank nodes

Oh, right! I didn't think about that. So indeed, TPF (as an abstract API) can not rely on bnode identity.

OTHO, I'm assuming that Blazegraph has a stable identifiers for the bnodes it stores. So it would be nice if its own specific API would allow to handle them explicitly -- even if that involved skolemization, so that formally, only IRIs would be exchanged.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/129#issuecomment-479412675, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4HvYMnBbkrvF22txXqpW3EDRCBMWks5vdHQ_gaJpZM4cQFcU .