EnvironmentOntology / gaz

An open source gazetteer constructed on ontological principles
Other
7 stars 5 forks source link

Decide how to modularize GAZ such that individual subsets can be managed in github #21

Open cmungall opened 5 years ago

cmungall commented 5 years ago
cmungall commented 5 years ago

@rctauber how were you planning to split things into modules? I see you have breakdown by country just now. Do you include everything that is located in a country, including geographic features such as lakes, rivers and the like? What bout features that overlap two countries?

beckyjackson commented 5 years ago

The country modules are everything that is related by either located_in or subClassOf. I'm not sure how overlapping features are handled currently in GAZ, but the modules would reflect that.

We originally discussed starting with countries, then expanding to other subsets like oceans and seas.

But, if overlapping features appear in multiple modules (and I imagine there will be overlap between things like counties and oceans and seas), it will be hard to make sure things stay up-to-date if we are using the modules to develop...

pbuttigieg commented 5 years ago

@rctauber

I'm not sure how overlapping features are handled currently in GAZ, but the modules would reflect that.

As long as they're of different types, I think there shouldn't be conflicts in the subClassOf hierarchies. The RO:overlaps relation and its subproperties can be (is?) used to assert this sort of mereotopology. Even if these are in different modules, this should hold as long as there are some checks in place to make sure classes/instances are present across modules.

On that note, @cmungall and I had several conversations over the years about the need to generalise spatial relations in ontologies like BSPO and RO to the planetary science case. I think GAZ will need these too. @cmungall time for an RO-geo subset? Branching off to #24

beckyjackson commented 5 years ago

As long as they're of different types, I think there shouldn't be conflicts in the subClassOf hierarchies.

What about the 'located in' hierarchies, though? The modules include subclasses and located in. For example, say a river is located in two countries and we need to update the label of that river. Even if we check for 'overlaps', how do we know which one is newer? I guess I could write a script that takes the changes from the most-recently updated modules but it may get complicated.

How do we determine the initial conversion is not lossy?

Before we tackle the above problem, I think this is the more important issue.

On another note, I can regenerate the modules from GAZ to keep them up-to-date, but I'm using a version of ROBOT that has a few unreleased features. The two main ones are improved templating and use of Jena's TDB feature to store a dataset on-disk (which makes querying infinitely faster). I'm pushing to get the updated templating merged in, then I need to make a PR for the Jena stuff. I don't want to include a custom ROBOT JAR in this repo since there are already many large files.

As soon as these features are released, I can add the rules to the Makefile to generate modules so that anybody can do this. That said, it doesn't solve our problem of using modules to actually build GAZ, but at least the modules can be kept up-to-date.

cmungall commented 5 years ago

@rctauber

What about the 'located in' hierarchies, though? The modules include subclasses and located in. For example, say a river is located in two countries and we need to update the label of that river

Not sure if I am totally following. This issue is about modularization rather than labels, it sounds like you may also be making unique labels? (see #26).

But in answer to the main question, it should not be possible for an entity to be in RO:located-in two locations where those locations do not overlap (by definition). Thus if we choose non-overlapping units as the modules and placement in the modules is determined by located-in, then nothing should be in more than one module. But note:

cmungall commented 5 years ago

Let me also state a few assumptions to check I'm on the same page as everyone:

beckyjackson commented 5 years ago

This issue is about modularization rather than labels, it sounds like you may also be making unique labels?

Sorry, I wasn't super clear. I was just using that as an example if we wanted to update the label of an entity that existed in two modules. This wouldn't be a problem if we are able to define non-overlapping modules, as you suggest above.

I agree with your stated assumptions.

cmungall commented 5 years ago

@rctauber going back to your comment from May 6. What are your plans for robot templates here?

beckyjackson commented 5 years ago

I don't have templates for the modules right now, but I can always make them if need be. I'm starting to see that ROBOT is having some trouble with any entities that are both named individuals and classes. For example, GAZ:00005229:

<!-- http://purl.obolibrary.org/obo/GAZ_00005229 -->

<owl:Class rdf:about="http://purl.obolibrary.org/obo/GAZ_00005229">
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Vennesla</rdfs:label>
    <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/GAZ_00002718"/>
    <obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A populated place.</obo:IAO_0000115>
    <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">GAZ</oboInOwl:hasOBONamespace>
    <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">GAZ:00005229</oboInOwl:id>
</owl:Class>

and...

<!-- http://purl.obolibrary.org/obo/GAZ_00005229 -->

<owl:NamedIndividual rdf:about="http://purl.obolibrary.org/obo/GAZ_00005229">
    <obo:RO_0001025 rdf:resource="http://purl.obolibrary.org/obo/GAZ_00012611"/>
</owl:NamedIndividual>

I'm trying to use robot filter to create a "bucket" of things missing from the country modules, but filter isn't working for these types of terms. We may need to resolve #20 before proceeding with modules.

cmungall commented 5 years ago

I agree we should fix the punning first.

My question was more along the lines of what you thought was best for the overall strategy. One possibility would be to maintain the entire ontology as a TSV and generate via robot template. I thought you might be thinking along these lines. There would be some definite advantages here. But it could be awkward editing the relational graph. And having mixed mode TSV and OWL may just add more complexity to what is already likely to turn into quite a complex build.

It may be the case that we don't need to worry about templates just now and just focus on modularizing the OWL (but still, fixing the punning would be good)

beckyjackson commented 5 years ago

My plan was to modularize first, and then determine if we want to move to templates later. So I think we are in agreement there.

I think we should discuss #20 on our next GAZ call and (perhaps) move forward on converting all those into individuals. Then, I could work on building a "bucket" that contains all the terms not in one of the country modules.