CommonCoreOntology / CommonCoreOntologies

The Common Core Ontology Repository holds the current released version of the Common Core Ontology suite.
BSD 3-Clause "New" or "Revised" License
175 stars 51 forks source link

How to model "pure" information content? #59

Closed tonicebrian closed 2 weeks ago

tonicebrian commented 4 years ago

This is a general modelling question and I wasn't able to find a proper Slack channel or discussion forum for CCO, so sorry for hijacking.

I understand that one of the goals of CCO is integrating several datasources into a common ontology but what happens when we are creating data from "within" CCO that hasn't any provenance attached. Let me illustrate with an example. Say we are modelling the landing on the moon event:

:MoonLanding cco:occurs_on :1969-07-21

so for me the individual :1969-07-21 is a member of class cco:Day and we all agree on that. But then if I need to access associated data I have several alternatives:

  1. Create a cco:InformationContentEntity and then a cco:InformationBearingEntity and finally use a cco:has_datetime_value. But for me it is unnatural to have a cco:InforamationBearingEntity because in my domain there isn't any material object bearing that information. It is a pure fact.
  2. I could use the cco:is_tokenized_by to the Information Content Entity but since this is an annotation I loose the fact that this is a datetime and I cannot reason about it. And I still have the problem of two individuals (the instance of cco:Day and the instance of cco:InformationContentEntity pointing to the exact same concept in my domain).
  3. I could import the OWL Time ontology and make :1969-07-21 both an instance of cco:Day and time:ProperInterval. Then I solve the problem by having a canonical point in time described using OWL Time but then I worry that by deviating from the semantics of CCO I will face problems in the future.

So my question is, what's the best approach for this modelling problem? Or more generally, is the pattern Entity -> InformationContentEntity -> InformationBearingEntity -> data properties just intended for data consolidation from different sources or is it something thought for modelling every piece of information in my domain? When or what is the situation in which I should create a simple owl:DatatypeProperty for storing data related to an individual when working with CCO?

rorudn commented 4 years ago

Toni, thanks for this. It's a great question. The short answer is no, the Entity -> ICE -> IBE -> data value pattern is not intended as being required for every piece of information in a domain but it is the recommended pattern for handling literal values.

Using CCO doesn't require that you designate a temporal region with a timestamp. For that matter, the CCO doesn't require that instance level events are related to temporal regions, but if they are so related, then as you describe it would be accomplished by using an instance of a BFO Temporal Region (or one of its subtypes) as the object of this connection:

event_x occurs_on temporal_region_y.

This might be enough for some applications. For example, you can search for all the events that occur on the day of the moon landing without using a date literal:

SELECT ?event WHERE { ?event_x rdf:type cco:MoonLanding . ?event_x occurs_on ?temporal_region . ?event occurs_on ?temporal_region. }

If you want to search for the event by date string or use that string as input to a temporal reasoner, then you have to put it somewhere into your knowledge base. Let me try to soften the objections you mention about the options that are provided by CCO (i.e. IBE and is_tokenized_by). It doesn't seem too unrealistic to suggest that days are designated by ICE's which are in turn carried by IBE's. After all if you've chosen a day, then you've chosen a calendar (Gregorian, Julian) and calendars have date IBE's as parts. If the number of "hops" is still bothersome you could create a property chain to shorten their number (temporal_region designated_by ICE generically depends on IBE -> temporal_region identified_by IBE) As far as using the is_tokenized_by annotation, couldn't you type the annotation value as an xsd:datetime which might help keep that fact and allow you to continue to reason about it?

If none of these are convincing and you still want to use a datatype property, then my advice is to limit yourself to using datatype properties that have a minimum of implicit content. So while the use of has_datetime_value as in the following is OK:

person_x participates_in birth_event_x has_datetime_value 1969-07-21

to shorten this further by using has_birthdate is not:

person_x has_birthdate 1969-07-21

The reason is that event has become embedded into the relation and can't be linked to other participants or locations making integration to other data problematic.

I hope this was helpful and if not, please post a follow up.

tonicebrian commented 4 years ago

Thanks for the response @rorudn it is more clear now to me. Just a couple of questions, when you say ...but it is the recommended pattern for handling literal values., isn't every piece of information a primitive literal value (int, string, datetime....) at some point? I understand the concept but I spend too much time thinking when a "data chunk" needs to be modelled straightforwardly or through the conceptualisation in CCO.

After all if you've chosen a day, then you've chosen a calendar (Gregorian, Julian) and calendars have date IBE's as parts. Agree on that. In fact this is more or less what the Time OWL ontology is doing for describing dates, see picture: image

Here I'm thinking more about performance and storage of the triple store so I'm trying to avoid extra storage or query processing when there is no clear benefit. Even if I can define that there is a IBE that holds value 1969 denoting the year part of a Gregorian date it is less useful because there won't be any other IBE holding a value for a year in a Gregorian calendar representing what happened on year 1969 in the real world. So then my approach would be to collapse those indirections into a single ICE, being both a ICE and a time:DateTimeDescription. Then the modelling would be something like (I'm omitting intentionally stating the data properties of the DateTimeDescription in order to keep examples short):

:theDayArmstrongLandedInTheMoon a cco:Day .
:moonLanding a cco:ActOfMotion .
:1969-07-21 a cco:InformationContentEntity, time:DateTimeDescription .

:moonLanding cco:occurs_on :theDayAmstrongLandedInTheMoon .
:theDayAmstrongLandedInTheMoon cco:designated_by :1969-07-21 .

But then I keep thinking about it, and I see that there is room for more compaction. In the modelling I'm doing here there isn't any "Information Content", what there is is a specific time on which everyone agrees on what does it mean "July 21st, 1969 according to the Gregorian calendar". So I could have:

:moonLanding a cco:ActOfMotion .
:1969-07-21 a cco:Day, time:DateTimeDescription .

:moonLanding cco:occurs_on :1969-07-21 .

With this modelling I see two benefits, one is that the entity :1969-07-21 is embedded in CCO so I can use properties in the TimeOntology.ttl and second, for displaying information and/or querying specific fields I can use the owl:DatatypeProperties defined in the DateTimeDescription in the Time Ontology.

I understand that by removing indirections I constraint myself to not being able to assert facts about ICEs or IBEs, but in this particular situation, working with "pure" facts decreases storage, makes queries easier to write and understand and alleviates reasoning pressure on the triple store.

So long story short, do you see any drawbacks on the last approach I've presented for the modelling of the date at which Armstrong landed in the Moon?

APCox commented 4 years ago

Yes, there are drawbacks. In the end, your approach gives you: <X> instance_of cco:ActOfMotion cco:occurs_on <Y> instance_of cco:Day (and instance_of time:DateTimeDescription) In other words, you've either lost or hidden almost all of the "pure facts" as you're calling them. This representation will be semantically equivalent to the representation of every other motion process that occurs on a day in your triple store. The only difference will be the IRIs of the <X>s and <Y>s.

Please note that this semantic loss is not specifically due to the decision to not use ICEs or IBEs; rather, it is primarily due to the fact that you: a) apparently no longer have literal values in your triple store and b) definitely do not have ontological representations of what the specific event is, where it is occurring, who or what is involved, how these participants are involved, etc.

I'm assuming that your plan is to bury some or all of this content in the IRIs of the 2 entities in your triple store. You can certainly do this if you want and you can even make it work for a limited use case; however, it is going to become a nightmare if you attempt to significantly scale up the number of entities you put in your triple store.

Based on your concerns about avoiding bloat in the triple store, I'm assuming that you do in fact intend to populate it with a great many such entities -- perhaps on the order of billions of triples. If this is not the case, and you construct your triples well, you should have minimal to no performance issues when using the recommended more robust CCO representations. If, instead, you are working with a huge number of triples (or even just a modest number) and have chosen to implement the bare-bones approach you describe, you are going to have a terrible time getting meaningful content back out. In particular, you will need to rely on string matching to get content out of the entity IRIs (or out of the literals if you have included them). This can certainly be done in SPARQL, but SPARQL only has limited string matching functionality built in and using REGEX in SPARQL will definitely cause performance issues in a large triple store.

Perhaps I'm incorrect in my understanding that your proposed representation includes no literals. In this case, you must be using custom data properties to link the literals to the 2 individuals in your example. This approach is taken by many other graph database users and, while it does in your case improve the semantic content and query-ability of your triple store, it has significant limitations. As Ron mentioned previously, it buries semantic content, limits the expressiveness of your representation, and impedes data integration.

That being said, the specifics of your use case may allow you to use your proposed minimalist representation schema without any drawbacks, provided of course that your use case does not change.

tonicebrian commented 4 years ago

Actually I plan to use data properties but when I was presenting the example I noted I'm omitting intentionally stating the data properties of the DateTimeDescription in order to keep examples short. With this particular example a more complete statement of the last approach would be:

:moonLanding a cco:ActOfMotion .
:1969-07-21 a cco:Day, time:DateTimeDescription ;
                      time:year "1969"^^xsd:gYear ;
                      time:month "--07"^^xsd:gMonth ;
                      time:day "---21"^^xsd:gDay ;
                      time::hasTRS <http://www.opengis.net/def/uom/ISO-8601/0/Gregorian> .

:moonLanding cco:occurs_on :1969-07-21 .

or if I would be using time:Instant instead:

:moonLanding a cco:ActOfMotion .
:1969-07-21 a cco:Day, time:Instant ;
                      time:inXSDDateTimeStamp "1969-07-21T02:56:00Z"^^xsd:dateTimeStamp .

:moonLanding cco:occurs_on :1969-07-21 .

The actual store of data properties doesn't matter for the question here. So no, no string matching on the IRIs since that would be worse for performance.

Maybe the word "pure" was wrongly picked here. Maybe a longer description would be "concepts in a domain for which everyone in that domain agrees on the unique entity it points to". I would say concepts that could be considered "closed world" in a domain, people could only say one thing about them.

In the example I'm presenting here :1969-07-21 uniquely represents the day in history in which all these events happened and thus (for me) it is less useful in a domain of historical events to represent it through an ICE (information of content of what? a day exists even if no one records it) and through IBEs (even if I create an ICE and I attach IBEs to the different parts of the date description, in this domain there will be always a 1-to-1 only correspondence between a particular IBE and a particular ICE).

For your concern on:

b) definitely do not have ontological representations of what the specific event is, where it is occurring, who or what is involved, how these participants are involved, etc.

I do, the event itself is :moonLanding and I can attach information about when it occurs (the example here), who was the agent :NeilArmstrong, etc... . The only thing that I'm losing here is that I won't be able to state facts about just the date, like

:1969-07-21 cco:is_a_measurement_of :NASA, :ESA .

because :1969-07-21 won't be an ICE and because in this domain and this particular example I don't care about :NASA and :ESA sanctioning the existence of an astronomical particular day, a day in a historical domain just is.

Just as a side note, it would be interesting whether as part of the CCO project a complete dataset using CCO from a complex domain could be published to see how others are fitting data to the conceptualisation provided by CCO.