OWL Provenance aware ontology parallel version

roxanadangerm commented 5 years ago

I love this Data science ontology initiative and the full mission of the project.

I understand the explanation in the FAQ "Why doesn't the DSO use the Semantic Web standards?" about the importance of an ontology language based on the lambda calculus. However, I think there may be ways to overcome this drawback (e.g. http://west.uni-koblenz.de/lambda-dl) and the advantages of making inferences, visualize and review the correctness of the axioms, will set-up the DS ontology in another level. So, I came out with this ontology which contain the most important upper-level concepts and solve the problem about how to link DS theory, its implementations (the annotation of your work) and provenance data generated from executions. I have added the main details of the k-Means algorithm, so we can discuss around a realistic example.

I haven’t added yet a provenance graph example using the program examples of the k-Means in your paper, but let’s see if all this makes sense to you all.

The main ideas are:

PROV-O is imported, and all the objects, morphisms and annotations (I prefer to call them implementations) should go to the respective place below concepts of the provenance ontology
Every object is an prov:entity or a prov: agent
Every morphism (function) is a DS entity (a sub type of prov:entity)
Implementations are also entities, describing the exactly data type implementation, ordering of parameters in morphisms, etc. (although there are a few examples as Python_3.6 class and RD_Python_3.6 instance, where I still have doubts about the best modelling approach)
A DS activity occurs when a DS morphism implementation is executed

Having all these, a DS program can be described as a graph of instances of DS activities, agents and entities.

DataScienceOnto.owl.zip

Many thanks in advance, Roxana

epatters commented 5 years ago

Thanks very much for all the thought and effort you put into this, @roxanadangerm. I'm definitely in favor of better interoperation with Semantic Web standards, such as OWL and PROV-O.

Once I understand this better and we talk through my questions, I'll modify the code that exports the concepts, annotations, and wiring diagrams as RDF to conform to some version of the scheme you have suggested. Thanks again!

roxanadangerm commented 5 years ago

Fantastic. Thank you!

epatters commented 5 years ago

I've opened the PR above as a starting point for further discussion. Here is the ontology in its current state.

epatters commented 5 years ago

I merged the above PR, after fixing several omissions in the OWL export of type and function annotations. Here is the updated ontology.

AFAIK, the only remaining task is to revamp the export of wiring diagrams (provenance graphs). I'll handle that in a separate PR.

roxanadangerm commented 5 years ago

Many thanks for exporting the ontology to OWL and adding support for provenance. There is one part that I struggle to understand: how two annotations referred to the same function/type concepts are linked? From what I see in the current ontology, python or R function annotations (that is the instances of FunctionAnnotation) are all of type FunctionAnnotation. I was hoping that all annotations of a particular function were linked to a specific FunctionConcept (or have a property that links them).

Suppose read-csv is implemented in both python and R. According to the current ontology you have:

FunctionConcept(read-csv): read-csv is instance of FunctionConcept read-csv is_a prov:Activity FunctionAnnotation(python:read-csv) -> python:read-csv is an instance of the FunctionAnnotation FunctionAnnotation(r:read-csv) -> r:read-csv is another instance of FunctionAnnotation

I propose to add: 1- the property: is_referred_to (or use any other with the same semantics)

the two assertions: python:read-csv is_referred_to read-csv -> is_referred_to property says the FunctionConcept behind this python implementation python:read-csv is_referred_to read-csv -> is_referred_to property says the FunctionConcept behind this python implementation So, now, the python and r annotations for read-csv can be linked as the underling concept of reading a csv file.

If we agree on this, I can continue working on examples of provenance that use the concepts and instances we are defining.

epatters commented 5 years ago

I'm not sure what you mean by your third sentence. There are Python and R annotations of type TypeAnnotation as well, not just FunctionAnnotation.

You're correct that some annotations are currently missing links to the corresponding concepts. That is the case only for function annotations whose definition is not a single concept, but a composition of multiple of concepts (such as R's read.csv). They are missing because the export of compound definitions as wiring diagrams (or even expression trees) is not properly implemented. I'm working on that.

We could also add a property is_referred_to, as you suggest, which holds whenever an annotation uses a concept anywhere in its (possibly compound) definition. This would be less semantically precise than the expression tree or wiring diagram, but presumably easier to work with.

epatters commented 5 years ago

Update: The RDF/OWL export of wiring diagrams has been revamped and is working again.

It remains to supplement our custom vocabulary for wiring diagrams with vocabulary from the PROV Ontology, particularly prov:used, prov:wasGeneratedBy, prov:wasInformedBy, and prov:wasDerivedFrom. The wiring diagrams will then double as "provenance graphs" and will be consumable by third-party tools that understand such graphs.

Note that PROV-O is more general than our vocabulary and will be a lossy representation. Thus it makes sense to use both vocabularies simultaneously.

epatters commented 5 years ago

Update: I've added PROV-O support to the RDF/OWL export of wiring diagrams. This means that wiring diagrams can now double as provenance graphs.

Here's how the PROV-O encoding works. Boxes becomes instances of prov:Activity and their ports become instances of prov:Entity. Input ports and output ports are prov:used and prov:wasGeneratedBy by their boxes, respectively. A box is prov:wasInformedBy another box if there are one or more wires from the latter box to the former box. Finally, an input of a box is prov:wasDerivedFrom an output of another box if there is a wire from the latter port to the former port. (Usually the prov:wasDerivedFrom property will represent an identity transformation, but in general they represent implicit conversions.) Thus the topology of the wiring diagram is naturally represented by the topology of a provenance graph.

@roxanadangerm, any feedback would be appreciated. The prototype OWL ontology is attached.

roxanadangerm commented 5 years ago

Hi Evan, I haven't reviewed the OWL ontology (will do next week), but the PROV-O encoding explanation looks good. We have been working on parallel, I put my hands on: 1) how to extend the onto to make it "friendly/browsable/autoexplainable", so, I added some high level concepts that can clearly separate the steps of a DS system and the resources used during its execution), 2) keep track of the library versions, and 3) how traces can be constructed from the code

Please, find in the attachment my ontology (it will need to be refactored with the nomenclatura at DS ontology) and a few images for us to discuss: Fig. 1, Fig. 2 - Extension of the hierarchies: 1) morphisms (a specialisation of prov:Activity) are separated in a hierarchy that considers data preparation, learning algorithm, model generation and model validation; 2) prov:SoftwareAgent contains: servers, databases, programming languages, DS libraries; 3) DS entities have data types, parameters and implementations. For example, the class n_clusters_sklearn_0_20 is an implementation of the well known k parameter for the current version of scikit-learn. It has the property implementationNamedAs with unique possible value n_cluster because is the parameter name for the kmeans algorithm. Given this, Implementation class should be extended to have one class for each DS method or parameter in each package, they will serve as a bridge between a morphism (a generic function) and its current implementation (or annotation) in a particular library. Provenance traces can contain nodes from general morphisms or for implementation nodes, all implementation nodes can be generalised to the morphism or entity that this implementation represents. So, inferences can be done at both levels: morphisms or its specific implementstions (or annotations). Fig. 3, .graph file - Example of a possible provenance trace that considers all the above elements: it shows the story of a server in aws that it installed pandas, sklearn and the iris dataset, it executed a read_csv, defined a clustering model and finally generated the clustering for the iris data by executing the fitting method. The figure shows the relevant parts of the Tbox and the Abox, but it might be easier if you open the graph file using the OntoGraf plugin in Protege (remove the .txt from the extension).

[prov_trace_example.graph.txt] (https://github.com/IBM/datascienceontology/files/2795451/prov_trace_example.graph.txt)

DataScienceOnto.owl.zip

epatters commented 5 years ago

Hi @roxanadangerm, thanks for all this effort. There's no doubt that package versions need to be tracked, so that's good.

When I try to open the graph in OntoGraf, I get an error message saying "the graph does not correspond to your active ontology." I guess you need to attach your version of the ontology as well? (Note that GitHub will let you upload a zip file, so you can preserve the file extensions.)

Anyway, hopefully we can discuss sometime next week.

roxanadangerm commented 5 years ago

Thanks. Sorry, I forgot to add the .owl file. I've just updated my previous comment to add the ontology. Let's organise a discussion for the next week (Tuesday and Wednesday works better for me).

epatters commented 5 years ago

I'm closing this issue, as the main task of improving the RDF/OWL export is mostly finished. I've split the remaining related issues raised here into their own issues: #11, #12, and ibm/semanticflowgraph#10.

IBM / datascienceontology

OWL Provenance aware ontology parallel version #8