ExPaNDS-eu / ExPaNDS-experimental-techniques-ontology

EU Photon and Neutron Ontologies (task 3.2)
8 stars 4 forks source link

A question about subsumption hierarchy: are there really two hierarchies here? #72

Open paulmillar opened 1 year ago

paulmillar commented 1 year ago

Hi,

PaNET uses the RDFS subClassOf relationship between a term an any broader terms. Here is a randomly chosen example, showing PaNET01272:

<http://purl.org/pan-science/PaNET/PaNET01184> a owl:Class;
#       [...]
        rdfs:label       "x-ray scattering" .

<http://purl.org/pan-science/PaNET/PaNET01271> a owl:Class;
#       [...]
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01184> ;
        rdfs:label       "microfocus x-ray scattering" .

<http://purl.org/pan-science/PaNET/PaNET01272> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01004> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01271> ;
        rdfs:label       "nanofocus x-ray scattering" .

In the above example, the term "nanofocus x-ray scattering" is identified as a more specific term of "microfocus x-ray scattering" and "x-ray scattering". Put another way, searching for x-ray scattering should yield results that are tagged microfocus x-ray scattering and nanofocus x-ray scattering.

The problem, if it is one, comes because the same rdfs:subClassOf is also used for more structural relationships; for example,

<http://purl.org/pan-science/PaNET/PaNET00002> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00001> ;
#       [...]
        rdfs:label       "defined by experimental probe" .

<http://purl.org/pan-science/PaNET/PaNET00106> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00002> ;
        rdfs:label       "microfocussed probe" .

<http://purl.org/pan-science/PaNET/PaNET01271> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00106> ;
#       [...]
        rdfs:label       "microfocus x-ray scattering" .

My concern is that I'm not sure that searching for http://purl.org/pan-science/PaNET/PaNET00002 ("defined by experimental probe") makes much sense. Almost all terms are a subclass of PaNET00002 (the few that are not could be due to problem with the ontology).

I could imagine searching for subclasses of PaNET00002 would make sense to a researcher. For example, searching for data identified by PaNET00106 ("microfocussed probe") would select only data created using a "microfocussed probe".

This question can be rephrased (in more practical terms) as the question: should PaNET01272 includes all parent classes; i.e.,

<http://purl.org/pan-science/PaNET/PaNET01272> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00001> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00002> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00003> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00100> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00106> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00200> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01004> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01012> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01184> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01271> ;
        rdfs:label       "nanofocus x-ray scattering" .

or only a subset of parent classes, excluding the "photon and neutron technique", "defined by experimental probe" and "defined by experimental physical process" terms.

<http://purl.org/pan-science/PaNET/PaNET01272> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00100> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00106> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET00200> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01004> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01012> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01184> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01271> ;
        rdfs:label       "nanofocus x-ray scattering" .

For reference, the file source/PaNET.owl in our git repo contains this definition (after converting to Turtle):

<http://purl.org/pan-science/PaNET/PaNET01272> a owl:Class;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01004> ;
        rdfs:subClassOf  <http://purl.org/pan-science/PaNET/PaNET01271> ;
        rdfs:label       "nanofocus x-ray scattering" .

This is clearly missing the vast majority of the subClassOf assertions.

The question here is: why are these assertions missing?

spc93 commented 1 year ago

I agree that the classes ‘defined by experimental probe’ etc are structural and exist to help organize the ontology. They will not be very useful for the PaN search use-case. This was a deliberate design decision, to support the structure and possibly to help with the future use of properties to define classes (e.g. ‘definedBy some experimentalProbe’). I would rather keep these as they are, if possible.

The second comment related to apparently missing superclasses. I think these should be generated by the reasoner and we need to ensure that any query is aware of all the superclasses.

For example, the superclasses of ‘nanofocus x-ray scattering’ are:

Parents of nanofocus x-ray scattering in http://purl.org/pan-science/PaNET/PaNET.owl:

nanofocus x-ray scattering nanofocussed probe microfocus x-ray scattering microfocussed probe x-ray scattering scattering technique defined by experimental physical process x-ray probe photon probe defined by experimental probe photon and neutron technique

paulmillar commented 1 year ago

Thanks for the feedback @spc93,

If I repeat what you said (to make sure I understand it correctly), your opinion is that each term should contain (explicitly) a complete list of its parent classes.

Just to make it very clear, the term for "nanofocus x-ray scattering" is currently defined in http://purl.org/pan-science/PaNET/PaNET.owl with the following XML:

<owl:Class rdf:about="http://purl.org/pan-science/PaNET/PaNET01272">
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01004"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01271"/>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">nanofocus x-ray scattering</rdfs:label>
</owl:Class>

It shows (explicitly) only two terms: PaNET01004 "nanofocussed probe" and PaNET01271 "microfocus x-ray scattering". These are the two rdfs:subClassOf elements in the above XML.

If I've understood your comment correctly, you are saying that this term should define explicitly all eleven parent classes, right? There should be eleven rdfs:subClassOf statements; i.e.,

<owl:Class rdf:about="http://purl.org/pan-science/PaNET/PaNET01272">
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01272"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01004"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01271"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET00106"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01184"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET00200"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET00003"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET01012"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET00100"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET00002"/>
    <rdfs:subClassOf rdf:resource="http://purl.org/pan-science/PaNET/PaNET00001"/>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">nanofocus x-ray scattering</rdfs:label>
</owl:Class>
brianmatthews42 commented 1 year ago

HI Paul,

My view is it depends on how clever you expect the search to be.

Semantically, the forms are identical - putting the subClass relationship explicitly in the class definition does not add any additional information which was not already in the subclass relationships (via transitivity). And a reasoning engine could infer all the superclasses of a class. So a search which looked for all the sub-types of a particular technique up the hierarchy could find them all down the hierarchy if the reasoning was built into the search engine. But if your search engine is dumber and only looked at the explicitly named subclass relation, then you would need all the triples to be included.

However, if you were explicitly to write all the relationships out for each one, you would give yourself potentially a maintenance problem as you might have to change every subclass each time you modified the hierarchy.

A compromise might be to run the reasoning engine once and add the subclasses automatically, generating a fully expanded version. You would need to rerun it each time you make a change.

As usual, no definitive answer, but options and trade offs!

Yes there are two hierarchies - but that's ok, they are for different purposes.
You don't need to expose them both in your search.

hope that makes sense!

Brian

spc93 commented 1 year ago

Hi Brian, If I understood correctly, this is precisely what we do: we run the reasoner and save the results back into the ontology. We do this to allow a dumb reasoner to find all the superclasses.

paulmillar commented 1 year ago

Just to further clarify, the PaNET build script takes a file (containing the list of terms, as a CSV table) and the (mostly) static ontology metadata and builds the resulting ontology. The script takes care to run the output through a reasoner, so this is fully automated.

The problem is that the terms in the ontology files currently saved in the github repo (releases/latest-release/PaNET.owl and source/PaNET.owl) do not contain all the super classes (see above for an example).

To me, this looks like a mistake: each term should contain all super classes. However, I'd like other people's opinion on this.

@spc93 (if I've understood correctly) thinks that all super classes should be stated explicitly.

spc93 commented 1 year ago

Yes that's what I expected but I've not looked directly at the owl file. Using the owlready2 python module, the classes are represented by python classes. Then the .mro method finds all subclasses in a single step.

spc93 commented 1 year ago

I mean superclasses.

BalazsBago commented 1 year ago

The OWL specification has a definition for the subClass axiom. Based on this it looks a bit weird too list all parent classes of a class in the definition of class and probably standard application for OWL files won't handle this situation correctly.

paulmillar commented 1 year ago

@BalazsBago Thanks for your interest and comments.

You're quite right: OWL has a owl:subClass axiom. In OWL v1 this was somewhat vaguely similar to rdfs:subClassOf. With OWL v2, this was made explicit: owl:subClassOf ⇒ rdfs:subClassOf. A similar "problem" exists with owl:Class and rdfs:Class: this was vaguely the same concept under OWL v1, but make explicit under OWL v2.

In PaNET, we use owl:Class but rdfs:subClassOf. This is somewhat inconsistent and probably should be fixed, but I don't think this is your point. Your argument applies equally to RDFS and OWL.

In terms of this issue: you are right, both rdfs:subClassOf (which is actually being used) and owl:subClassOf are transitive. Formally, this is ∀ A,B,C ∈ Class (A subClassOf B ∧ B subClassOf C) ⇒ A subClass C. Therefore, there is no need to state the implied axioms: it is understood from RDFS and OWL semantics.

However, stating only the minimum axioms assumes that the agent consuming these statements has an RDF/OWL reasoner: some software that understands the semantics of RDFS/OWL as is able to generate the implied axioms automatically.

What we're doing here is generating the transitive closure for the subClassOf relationship; that is, we are making explicit the relationships that are implied by RDFS/OWL.

The result is that people can use PaNET without running it through a reasoner. If you want to know what are the super classes of some class? Just look it up. What are the subclasses? Just find all Classes with the appropriate subClassOf relationship. This makes using PaNET much easier.

AFAIK, there is nothing wrong with making explicit the implied axioms. There's no requirement (in RDFS or OWL) to exclude implied relationships/axioms and stating the same thing twice doesn't break anything.

Do you have a specific example that demonstrates the problem, where making implicit axioms explicit causes software to fail?

gkoum commented 1 month ago

I think this can be closed since we now release a reasoned version of the ontology with all subclasses so that a simple SPARQL query without the use of a reasoner can provide all relationships. #134

paulmillar commented 1 month ago

The original intent of this issue was in the direction of #111; however, I suggest we close this issue and focus on #111 instead, as I think that issue is better described.

In terms of progressing, I've tagged this issue with the "OK to close?" label. The intention is to indicate the proposal (from @gkoum) to close this issue. We then give people (involved with this issue) the opportunity to disagree with closing the issue. If nobody disagree within "a reasonable time", we should close this issue.