blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
898 stars 173 forks source link

CLEANLY ADD & DELETE RDF ONTOLOGIES ? #24

Open redskate opened 8 years ago

redskate commented 8 years ago

Dear blazegraph community / SYSTAP

Semweb realized several applications successfully using blazegraph. Thank you for being. I am using the latest version of your blazegraph.jar (thank you).

What I am rather sceptical of is cleanly ADDING / DELETING RDF ontologies. An ontology - like for instance http://spec.edmcouncil.org/fibo/FND/Parties/Roles/ contains structures which - once read in into the repository - ends up in blank nodes. It is important for me to add an ontology to a (now blazegraph) repository with truth maintenance, then on demand delete that ontology.

Sesame 2.8 (used and customized by SYSTAP for blazegraph) allows a file-input via connection.add() in a remoterepository but has no way to delete the same RDF file using the same connection (asymmetry of sesame 2.8); using connection.remove(statement) does not work with remoterepository (throws an exception). Blazegraph wants to remain relied on sesame 2.8 because it is used by most of its customers. It is not even known whether SYSTAP will ever produce a sesame 4 based blazegraph (up to now and up to "official" annoucements).

So I searched for another way to add/delete RDF Ontology material to/from a blazegraph journal: via SPARQLUPDATE. An ontology is firstly loaded into a sesame model (included in blazegraph sesame), then each (NT) statement is collected and written to a string, then an INSERT statement is built using those NT Triples and those triples are (successfully) inserted via SPARQLUPDATE. So far so good.

THE PROBLEM: The same added triples ARE NOT DELETABLE via the same SPARQLUPDATE (simply turning "DELETE" into "INSERT" in the same SPARQLUPDATE tab inside blazegraph workbench - same behavior as in my java code), because those triples contains blank nodes (says blazegraph in an exception).

So blazegraph seems to be not able to DELETE cleanly RDF material ... and this is besides scalability and greatest performance (...) stated by your selling department, a big problem.

How is this possible and is it so wanted/needed/planned?

However I hope I ran myself into some mistakes and hope someone can help solve this issue ... Attached the statements which are INSERTABLE but not DELETABLE.

Thank you Regards

Some attachment to help you finding a solution (please open a ticket in case you need it): insertable_notdeletable_statements.txt deletingexception_blazegraph.txt

mschmidt00 commented 8 years ago

Thanks for your interest in using Blazegraph.

First of all, please note that this is not a Blazegraph specific issue. Blazegraph implements the standard, and the behavior that you observe is just the standard behavior: blank nodes are just not designed to be "referencable". What you want to achieve would not work out with any standard-compliant SPARQL endpoint.

That said, there are a couple of workarounds / mechanisms how people typically deal with that problem:

1.) The simplest way would be to eliminate blank nodes from the ontologies -- this could be done by some pre-processing of the ontologies prior to loading them. However, given that you want to reuse existing ontologies, I agree this is not really an elegant solution and therefore probably not the way you want to go.

2.) The second option is using named graphs for data management. Blazegraph supports a quads mode and what you can do is loading each ontology into a separate named graph (the Blazegraph and Sesame APIs perfectly support this use case). Concretely: take your ontology and load it into a named graph, say myNamespace:ontology1. You can then delete all the statements in that named graph by using the API (i.e., without explicitly listing the statement subject, predicate, and objects). Note, however, that Blazegraph does currently not support inference in quads mode, so if you need truth maintenance this might not be an option for you.

3.) Blazegraph offers a "told bnodes" mode, see com.bigdata.rdf.store.AbstractTripleStore.Options.STORE_BLANK_NODES. When turned on, Blazegraph will store blank nodes "exactly as given", allowing you to query (and delete) them again using the same identifiers. Note, however, that enabling told bnode mode means deviating from the standard. Still, in this mode it should be possible to delete triples with blank nodes again. Note that API handling for told bnodes may not always be intuitive (there's an open ticket, see https://jira.blazegraph.com/browse/BLZG-1915).

Best, Michael

igor-kim commented 8 years ago

@redskate ,

There is another trick, which could be useful. OWL does not require that all the classes, properties, restrictions etc. which are defined in an ontology have to be linked to this ontology, but in many cases it is useful. There is a concept of annotation properties in OWL: https://www.w3.org/2007/OWL/wiki/Syntax#Annotation_Properties One of suggested annotations is rdfs:isDefinedBy.

If you could ensure that all IRIs of classes and properties used as subjects in an ontology would never occur as subjects in any other ontology (basically if all your ontologies are segregated, each ontology defines its own properties and classes), there is a way to maintain ontologies in the triplestore:

1 . Load an ontology into temporary store (it could be configured to not run inferencing), collect all the subjects, defined in the ontology, and link them to ontology resource:

INSERT {
  ?s rdfs:isDefinedBy ?ontology .
} WHERE {
  ?ontology a owl:Ontology .
  ?s ?p ?o .
  FILTER (!sameTerm(?s, ?ontology)) .
}

2 . Then load the ontology into main triplestore. 3 . To update or remove triples of the particular ontology use ?s rdfs:isDefinedBy <http://targetOntology>, for example:

DELETE {
  ?s ?p ?o .
} WHERE
  ?s rdfs:isDefinedBy <http://targetOntology> .
}

4 . To update an ontology in the triplestore you might want to compute a difference between old and new ontology as inferenced triples are dependant on ontology changes, so removing an ontology will also remove all inferencing results and adding a new version of ontology will trigger full inferencing cycle. So adding only new ontologies statements and removing only those statements which are removed from a new version might significantly improve ontology update time.

Best regards, Igor

donpellegrino commented 8 years ago

One approach I have used for a read-only or add-only collection of triples is to stage all the individuals in a bulk load file (.n3). Then a namespace is used to load the individuals with inference disabled for load performance. Next, I load the .owl files into the namespace and enable inference. When the ontology changes or inferred triples need to be backed out I just drop and recreate the namespace with the new data and owl files. New bulk load files can be added in this scenario as long as they are saved as .n3 in addition to being added to the namespace.

A limitation of this approach is that is does not account for persisting new individuals added via SPARQL over time. Any triples added to the namespace outside of bulk loads are lost when the namespace is dropped and recreated from the bulk load and owl files.

redskate commented 8 years ago

Dear Michael, Igor-kim and donpellegrino,

thank you so much for submitting me your nice ideas, how to cope with his (simple) issue. I do need a repository with truth maintenance (hence - although not understandable to me yet - we have no truth maintenance with quads on blazegraph) - quads could do but are not a solution using blazegraph.

If the (current) standard does not involve processing blank nodes, somehow it is seems to be incomplete. As OWL has a counterpart in an RDF format and can be loaded from a file, it should be also unloaded from a file. From this lack of symmetry arises the issue to cleanly add&retract RDF statements inside a repo, independently of truth maintenance. This does not mean that a company like SYSTAP could implement it (...) since W3C only releases recommendations...

So I understand that the current "workaround" here is to process blank nodes coming from a Sesam 2.8 model (instantiated on some ontology) by transforming them into some named nodes, store all the gathered NT triples and use them afterwards to retract the same ontology. Thank you for the hint to the (uncommented?) option com.bigdata.rdf.store.AbstractTripleStore.Options.STORE_BLANK_NODES (My google search did not find a documentation chapter on it).

The road seams still to need to be further paved ...