CredentialEngine / CredentialRegistry

Repository for development of the Credential Registry
Apache License 2.0
12 stars 10 forks source link

Neptune needs to retain the URIs assigned to blank nodes #297

Closed siuc-nate closed 4 years ago

siuc-nate commented 4 years ago

Currently, for reasons we haven't yet determined, when a record like the one below is added to Neptune, the database is replacing the _:someguid URI with an automatically generated one like b12345. However, it appears to only be doing this to the @id field of the bnode. This causes problems when trying to query blank nodes since the original _:someguid URI still exists in the main record.

{
  "@graph": [
    { "@type": "ceterms:Credential", "ceterms:accreditedBy": [ "_:someguid", "https://.../ce-123" ] },
    { "@type": "ceterms:QACredentialOrganization", "@id": "_:someguid" }
  ]
}

If this cannot be solved directly (The registry team is working with AWS support), then we will need to explore some kind of workaround that won't break the data above (i.e. a query must work regardless of whether the URI is a bnode URI or a true URI)

stuartasutton commented 4 years ago

That's interesting, but I'd think it's really not inappropriate. If you move a RDF description from one context to another switching serializations in tools like the easyrdf converter or the RDF translator, bnode identifiers are changed every time by the systems. They aren't maintained but change from translation to translation. I would not be surprised to hear that if you want "_:c7084470-7c0c-4397-bb18-5a4223fd9c64" to have any persistence then don't make it a blank node.

siuc-nate commented 4 years ago

The problem is that the database should not be arbitrarily modifying the data that you put into it, at least not without giving you some kind of control to choose how it does so. The data effectively starts out as:

{
  "@graph": [
    { "@type": "ceterms:Credential", "ceterms:accreditedBy": [ "_:someguid", "https://.../ce-123" ] },
    { "@type": "ceterms:QACredentialOrganization", "@id": "_:someguid" }
  ]
}

and gets transformed into:

{
  "@graph": [
    { "@type": "ceterms:Credential", "ceterms:accreditedBy": [ "_:someguid", "https://.../ce-123" ] },
    { "@type": "ceterms:QACredentialOrganization", "@id": "b12345" }
  ]
}

which means the credential can no longer reference the blank node, since _:someguid ceases to point to anything. If that transformation flowed back into the credential record such that it became:

{
  "@graph": [
    { "@type": "ceterms:Credential", "ceterms:accreditedBy": [ "b12345", "https://.../ce-123" ] },
    { "@type": "ceterms:QACredentialOrganization", "@id": "b12345" }
  ]
}

then at least the connection would still work, even if it's still bad for a database to change what you put into it without you being able to override it.

stuartasutton commented 4 years ago

That's a different matter. If "_:someguid" get's changed to "b12345", it should be changed to"b12345" throughout the graph...I would think.

siuc-nate commented 4 years ago

Correct, but it would be preferable for it not to be changed at all.

excelsior commented 4 years ago

This is the info I got from the AWS developer forum.

Neptune doesn't seem to allow retaining original values of blank nodes by design as it follows the RDF spec in this regard:

Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes.

https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#section-blank-nodes

In much the same way as with named subqueries, Blazegraph—Neptune's predecessor—used to offer such settings:

https://github.com/blazegraph/database/issues/24#issuecomment-258022007

So, if a resource references another resource by a blank node within the same graph, it should be possible to keep that reference even despite the blank node value being overwritten. However, if such references exist across separate graphs, the connections will be 100% broken. In that case, we'll need to make sure blank nodes are replaced with dummy IRIs universally during indexing.

stuartasutton commented 4 years ago

@excelsior, above in https://github.com/CredentialEngine/CredentialRegistry/issues/297#issuecomment-583049089, @siuc-nate says that the bnode IDs are being overwritten inconsistently within the same graph which runs counter to the text from https://github.com/blazegraph/database/issues/24#issuecomment-258022007 above.

@siuc-nate' example:

{
  "@graph": [
    { "@type": "ceterms:Credential", "ceterms:accreditedBy": [ "_:someguid", "https://.../ce-123" ] },
    { "@type": "ceterms:QACredentialOrganization", "@id": "_:someguid" }
  ]
}

and gets transformed into:

{
  "@graph": [
    { "@type": "ceterms:Credential", "ceterms:accreditedBy": [ "_:someguid", "https://.../ce-123" ] },
    { "@type": "ceterms:QACredentialOrganization", "@id": "b12345" }
  ]
}
siuc-nate commented 4 years ago

Thanks for the reference. I think the difficulty comes from us using neptune as an index rather than a true main data store, so we would need that post's "told bnodes" functionality to make things work properly.

As a workaround, I agree that it seems to be necessary to do something else with the identifiers. It sounds like the only way to do it is, as you suggest, replace the bnode identifiers across the entire JSON-LD @graph with something that has a (fake) namespace so SPARQL sees it as just another URI. As long as we don't change it in the credreg:__payload strings (for either the parent object or the bnodes themselves) that might be doable - we'd just have to use the namespaced URIs any time a query needs to look up bnodes, and that should be doable with simple string substitution.

Is there any way to enable such a "told bnodes" setting in AWS Neptune?

I found a related (in terms of being blazegraph) post that shows the setting's name: https://github.com/blazegraph/database/issues/129#issuecomment-477632311

siuc-nate commented 4 years ago

Has there been any progress on this? I'm still seeing integer bnodes in my result data. For example:

PREFIX ceterms: <https://purl.org/ctdl/terms/> SELECT * WHERE { ?s ceterms:ctid 'ce-10372125-59ee-4c8c-a31d-6df55f5fa9ae' . ?s ceterms:approvedBy ?bnode . }

Shows a bnode with an id of b78283279. image

In addition, trying to look at the properties of that bnode either indirectly or directly shows no data:

PREFIX ceterms: <https://purl.org/ctdl/terms/> SELECT * WHERE { ?s ceterms:ctid 'ce-10372125-59ee-4c8c-a31d-6df55f5fa9ae' . ?s ceterms:approvedBy ?bnode . ?bnode ?p ?o . }
PREFIX ceterms: <https://purl.org/ctdl/terms/> SELECT * WHERE { <b78283279> ?p ?o }

image

However, that bnode should have several fields: https://credentialengineregistry.org/graph/ce-10372125-59ee-4c8c-a31d-6df55f5fa9ae image

excelsior commented 4 years ago

A bnode ID (_:<UUID>) gets converted into a special URI (https://credreg.net/bnodes/<UUID>) during the indexing process. Regular blank nodes are left intact.

The reason the first query returned incorrect results was an accidentally restored old dump , as I described in the email.