UUDigitalHumanitieslab / EDPOP

A virtual research environment (VRE) that lets you collect, align and annotate bibliographical and biographical records from several online catalogs.
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

replace_blank_node must go #172

Closed jgonggrijp closed 2 months ago

jgonggrijp commented 2 months ago

I just realized there is a critical problem with the piece of code below.

https://github.com/UUDigitalHumanitieslab/EDPOP/blob/f7de000426e60a4ba7d9753d62777d8a7620a526/backend/triplestore/utils.py#L74-L80

When I commented on it in #163, @tijmenbaarda stated that this function was necessary because Blazegraph did not support blank nodes, so we left it in.

At first sight, this function just replaces real blank nodes by an emulation of blank nodes. However, it actually introduces a risk of collision between unrelated nodes. Real blank nodes are intrinsically distinct; they can only be identified if they occur within the same representation. The emulation replacement, on the other hand, produces URIs consisting of a small random number embedded in a fixed string. It is only a matter of time before the same random number is reused. The duplicates will be generated on different dates, possibly by different WSGI workers, and most likely in the process of serializing different queries. This will compromise data integrity.

As for the argument that Blazegraph does not support blank nodes: this should not be true. Blank nodes are too essential to leave them out of any serious RDF implementation. In fact, the Blazegraph wiki mentions support for blank nodes and even an alternative mode of supporting them.

@tijmenbaarda, could you retrace the origin of your belief that Blazegraph does not support blank nodes? I'm sure there is a real problem that needs to be addressed, but we must find an approach that does not involve replace_blank_node or anything similar.

lukavdplas commented 2 months ago

Minimum non-working example:

from django.conf import settings
from rdflib import Graph, BNode

def test_save_blank_node():
    store = settings.RDFLIB_STORE
    graph = Graph(store)
    triple = BNode(), BNode(), BNode()
    graph.add(triple)

Add this as a unit test, run using pytest to see the error message.

jgonggrijp commented 2 months ago

Good thinking, @lukavdplas! Ironically, that error message links to a section of the SPARQL specification that explains how blank nodes should be implemented. It does not say anywhere that they should be omitted or unsupported. The grammar section is even quite explicit about blank nodes being allowed; there are a few corner cases where blank node labels are not allowed in SPARQL updates, but none of those cases is occurring in this test and even in those corner cases, blank nodes can still be referenced by variables. Storing and retrieving blank nodes is by all means allowed, according to the standard.

Your test inspired me to try some things in the Blazegraph UI. This SPARQL update is not allowed:

insert data {
  [] [] [].
}

but this one is:

insert data {
  [] a [].
}

and I am also able to retrieve that triple afterwards. The reason it was not allowed with a blank node in the middle, is that the predicate of a triple must always be an IRI.

Of course, that prompted me to retry your test with RDF.type as the predicate. It gives the same error, which seems to confirm that rdflib is making a misguided decision to completely disallow blank nodes in SPARQLStore. Coincidentally, @tijmenbaarda told me last Thursday that this is also what @JeltevanBoheemen told him.

So, two possible solutions:

  1. circumvent rdflib (good for performance and correctness but greater initial cost);
  2. use proper IRIs instead of blank nodes everywhere (easier at first, but might turn out to be unsustainable).

I lean towards option 1 because I have wanted to get rid of rdflib as a middle agent for a while in READ-IT as well. Not necessarily get rid of it wholesale, just not using it as an intermediary between us and Blazegraph. Besides performance (due to superfluous parsing and serializing) and artificial restrictions like this one, rdflib also has some other quirks, such as faulty serializations and an obnoxious logical inconsistency that can cause it to consider triple1 == triple2 but triple2 != triple1 in some cases. I don't have a link to online discussion about the latter, but it is causing every second migration in READ-IT to drop the colors from the annotation ontology.

JeltevanBoheemen commented 2 months ago

Following this discussion with interest.

Of course, that prompted me to retry your test with RDF.type as the predicate. It gives the same error, which seems to confirm that rdflib is making a misguided decision to completely disallow blank nodes in SPARQLStore. Coincidentally, @tijmenbaarda told me last Thursday that this is also what @JeltevanBoheemen told him.

See rdflib docs for a -very- brief justification for not supporting blank nodes in SPARQLStore and their proposed solution, that we implemented.

  1. circumvent rdflib (good for performance and correctness but greater initial cost);

Performance-wise, absolutely. However, I still cannot say with certainty that the problems occurring in READ-IT involving rdflib are caused by bugs in said package. They might spring from a misunderstanding or misuse on my/our part. I'm in favour of picking up https://github.com/UUDigitalHumanitieslab/readit-interface/pull/500 (where the triple1 == triple2 but triple2 != triple1 problem occurs) to establish that.

  1. use proper IRIs instead of blank nodes everywhere (easier at first, but might turn out to be unsustainable).

Why would this be unsustainable?

tijmenbaarda commented 2 months ago

See rdflib docs for a -very- brief justification for not supporting blank nodes in SPARQLStore and their proposed solution, that we implemented.

Hmm, but that means that read-it basically has the same solution as what we have with replace_blank_node, i.e. converting it to a named node based on rdflib's unique identifier.

But is that really a problem? rdflib assigns unique identifiers to blank nodes using UUID4. As far as I know that means that there is (virtually) no risk of collision. The identifiers are not stable if new data is added, of course, but that should not be a problem in our case. Or am I missing something?

lukavdplas commented 2 months ago
  1. circumvent rdflib (good for performance and correctness but greater initial cost);

I can't really judge whether this is a viable solution in the long term, but within the current project, "greater initial cost" is a huge red flag. In terms of both time and budget, the project is at a stage where we should be wrapping up and focus on the shortest route to a satisfactory end product. At best, this suggestion is something that could be considered in a future project.

However, I would also argue that this application is generally plagued by solutions that favour possible long-term gains over short-term ones, so I'm wary of this kind of suggestion.

As for this particular issue, I agree with Tijmen: with sufficiently large random IDs, the risk of collision is negligible. I would argue that the two solutions you propose ultimately carry a higher risk of data corruption (due to programming errors), especially if they have to be developed short-term. So if the purpose here is to protect data integrity, it may be best to leave this as-is.

jgonggrijp commented 2 months ago

@JeltevanBoheemen Thanks for chiming in!

See rdflib docs for a -very- brief justification for not supporting blank nodes in SPARQLStore and their proposed solution,

Thanks. I think that very brief justification is very misguided. I will quote it here for clarity:

By default the SPARQL Store does not support blank-nodes!

As blank-nodes act as variables in SPARQL queries, there is no way to query for a particular blank node without using non-standard SPARQL extensions.

See http://www.w3.org/TR/sparql11-query/#BGPsparqlBNodes

It may be true that blank nodes "act as variables" in SPARQL queries, but if it is true, it is still no reason to drop support for them. Suppose that I use the following INSERT update in Blazegraph, which is both allowed and well-defined:

insert data {
    rdfs:Class a _:1.
    _:1 rdfs:subClassOf rdfs:Resource.
}

The above minigraph contains a single blank node, here labeled as _:1. Now, it is true that I cannot issue a query like the following and expect it to return only the first triple above:

construct where {
    ?s ?p _:1.
}

Blank nodes are not supposed to be uniquely identifyable like that, anyway; IRIs already address that use case. However, I can still retrieve the above minigraph faithfully, i.e., with the knowledge that the blank nodes in both triples refer to the same blank node, for example with the following query:

construct where {
    rdfs:Class a ?type.
    ?type rdfs:subClassOf rdfs:Resource.
}

which might output something like

rdfs:Class a [ rdfs:subClassOf rdfs:Resource ].

If blank nodes "act as variables", that means I can also write the query above as follows, and still get the same result:

construct where {
    rdfs:Class a _:2.
    _:2 rdfs:subClassOf rdfs:Resource.
}

If that works, then I would argue that the fact that blank nodes "act as variables" is a feature, not a reason to omit support for them from rdflib.SPARQLStore.

that we implemented.

Ah, we did that _node_to_sparql thing, right? I forgot about that.

  1. circumvent rdflib (good for performance and correctness but greater initial cost);

Performance-wise, absolutely. However, I still cannot say with certainty that the problems occurring in READ-IT involving rdflib are caused by bugs in said package. They might spring from a misunderstanding or misuse on my/our part.

No, we debugged it together. We entered the interactive Python console and were able to make rdflib produce a logical inconsistency without any of our own code being involved. This is really a bug in rdflib.

I'm in favour of picking up UUDigitalHumanitieslab/readit-interface#500 (where the triple1 == triple2 but triple2 != triple1 problem occurs) to establish that.

While I am already sure that this problem is caused by a bug in rdflib, I am certainly also open to picking up UUDigitalHumanitieslab/readit-interface#500 again.

  1. use proper IRIs instead of blank nodes everywhere (easier at first, but might turn out to be unsustainable).

Why would this be unsustainable?

Because we may eventually want to copy RDF data from external sources to our own triplestore. You cannot simply replace blank nodes by IRIs in that case, because the triples will no longer be equivalent.

@tijmenbaarda

But is that really a problem? rdflib assigns unique identifiers to blank nodes using UUID4. As far as I know that means that there is (virtually) no risk of collision. The identifiers are not stable if new data is added, of course, but that should not be a problem in our case. Or am I missing something?

No, you are right. If the labels are generated with UUID, then collision should not be likely. I based my worry on the usual convention of labeling blank nodes with small integers like _:1. Given that this is not the case here, this issue can be closed.

@lukavdplas

  1. circumvent rdflib (good for performance and correctness but greater initial cost);

I can't really judge whether this is a viable solution in the long term, but within the current project, "greater initial cost" is a huge red flag. (...) At best, this suggestion is something that could be considered in a future project.

Fair point.

However, I would also argue that this application is generally plagued by solutions that favour possible long-term gains over short-term ones, so I'm wary of this kind of suggestion.

I guess you have a point here as well.

As for this particular issue, I agree with Tijmen: with sufficiently large random IDs, the risk of collision is negligible. I would argue that the two solutions you propose ultimately carry a higher risk of data corruption (due to programming errors), especially if they have to be developed short-term. So if the purpose here is to protect data integrity, it may be best to leave this as-is.

Yes, I agree with this as well. I was not aware of rdflib using UUIDs for blank nodes until Tijmen pointed it out. Let's close this ticket and not change anything about the way blank nodes are represented.

(As for rdflib choosing to label blank nodes with UUID: I still feel that is probably based on a false assumption that blank nodes need to be uniquely identifyable. Then again, that is a moot point now.)