MIxS as properties versus classes

ramonawalls commented 4 years ago

We discussed this in detail at the call on 4/13/20 (https://docs.google.com/document/d/1rY9yVRsASXthhY2CeBqDOnM4fsRGA4YBs_QHYAlYB1w/edit#). Open question if MIxS terms be classes, object properties, or data properties. Each has some advantages. Regardless of which we choose, many people will still choose to treat MIxS data as key:value pairs, which is more or less how it is treated now.

Classes:

Map to existing classes like in ENVO
Consistent with some ontology models where qualities are classes, such as the model used in BCO. (Need to check out the model being developed by @jamesaoverton.)

Object properties

Objects would have to be specified for different sets of properties, e.g., some would have an instance measurement value as the object, some have text.
Consistent with Darwin Core iri namespace

Data properties

Expect only literals as the value, so doesn't work well with properties where an instance would be expected.
Consistent with Darwin Core terms

Other options include making them annotation properties, or making them a mix of data and object properties.

Those on the call were leaning toward classes. Need to follow up with INSDCs do see if the choice will impact their workflows.

@jdeck88 and Jasper (can't find github name), any thoughts?

jdeck88 commented 4 years ago

I would lean towards data properties over classes.. the reason here is that the terms, as with DwC terms, have been developed and implemented outside of a logical framework for a long time... we have a history of use that is more consistent with their application as data properties. My proposal for terms needing to be used as classes would be to elevate a select set of terms in a new namespace that could be treated as classes. This is somewhat similar to the application of dwciri: namespace.

jjkoehorst commented 4 years ago

Sorry that I missed the meeting when discussing RDF. I think its a great idea and I would advice to use mappings when possible as the number of slightly different spellings when using (string-based) literals can easily explode. It might also be worthwhile to draft a shex schema which can then be used to validate the generated RDF?

only1chunts commented 4 years ago

Surely it has to be a mixture of data and object properties that are parts of various Classes. For example the term "name" is not going to be a class, it has to be a Data Property of the Class "Person" or some such high level class. I guess to echo @jdeck88 we tend to use everything as data properties currently, so its effectively already that with everything in 1 giant class called "bucket". So either elevating or creating some select set of Classes to help structure the data properties would be a sensible thing to do.

cmungall commented 4 years ago

@jdeck88:

I would lean towards data properties over classes

How would you handle a property such as biome, for which the value may be an ontology class? What about altitude? Is the value a string literal ("3m") or is it an object (blank node, has-unit=m, has-value=3.0)?

I think what you are saying is that the data is always so messy that we may as well just store everything as strings. Seems to defeat the point of using a semantic framework to begin with. And if we want to devise a formalism for exchange properly structured data we have to roll something different. Maybe I misunderstand?

all in favor of either OP or DP or some mix:

What about when values are ontology classes? If p is an OP then this induces punning:

:sample123 :p ENVO:456

is this our intent?

or is the intent to create blank nodes such as

:sample123 :p [a ENVO:456]

@only1chunts:

Surely it has to be a mixture of data and object properties that are parts of various Classes

not sure what you mean by parts of Classes..?

@jjkoehorst:

I think its a great idea and I would advice to use mappings when possible as the number of slightly different spellings when using (string-based) literals can easily explode

sorry, not following....

It might also be worthwhile to draft a shex schema which can then be used to validate the generated RDF

great point. I think this discussion should be a bit more driven by use cases like validation. shex makes a lot of sense for validation because it is closed world. rdf/owl is great for open world reasoning. But it isn't going to complain if a minCardnality=1 field doesn't have a value

cmungall commented 4 years ago

My argument for classes echoes what @ramonawalls says, but also to emphasise some key points

We get to defer on whether a property is best modeled as a native string literal, a datatype literal, an ontology class, an instance, a string literal that is intended to be parsed and treated as an object.

For example:

mixs:elevation a owl:Class .
...
gold:sample123 :has-characteristic [
  a mixs:elevation ;
  :has-string-value "200m +/- 1m" ;
  :has-measurement-value [
     :magnitude "2"^^xsd:float
     :has-unit PATO:nnn
  ]
  prov:providedBy "ORNL Identify tool"
  dc:date ...
  ...
]

This allows for the direct representation of a cell in a spreadsheet as a string literal, plus an object representation

it also naturally allows for all kinds of provenance and metadata on the assignment of the descriptor value to the sample. with a property this would have to be done with reification

on the negative side it is more verbose than a simple SAMPLE PROPERTY VALUE model. It makes certain kinds of checks and validation harder (but these may be harder than you think with rdf anyway)

jdeck88 commented 4 years ago

Responding to @cmungall My point was more that even if we expressed data as literals (data properties), incoming data will be made far cleaner by clarifying our definitions and giving them stable URIs. @wdduncan brought up on the call we could create two test frameworks where we can express data as either Classes or Data properties and see how they behave in the wild which is a good idea.

All this said, getting stable URIs for terms is paramount.

wdduncan commented 4 years ago

In https://github.com/GenomicsStandardsConsortium/mixs-rdf/tree/master/src/ontology I've created a property version and a classes version:

mixs_package_class.ttl is the class version
mixs_package_dp.ttl is the data properties version.

An object properties version can (of course) be created too.

wdduncan commented 4 years ago

I'm leaning towards classes. Reasons:

(Suggested by Pier) Classes allow for a developers to create instances (or individual datums) of mixs:classes that are tied to a particular data source. For example, an instance of mixs:deth that is found in spreadsheet X, which distinct from the instance of mixs:depth that is found in spreadsheet Y. Similar may apply to instances of packages.
Permits us to more easily create relations between a mixs:term and the packages it is found in. E.g.: mixs:depth dc:isPartOf mixs:water-package, mixs:soil-package etc. This example uses dc:isPartOf as an annotation property, but object property relations can be developed too.
More easily allows us to use OWL reason to reasoning to maintain packages. Currently, terms are explicitly asserted as being a subClassOf/subPropertyOf some package (e.g., mixs:depth rdfs:subClassOf mixs:SoilPackage). But, suppose we have a property mixs:package. We do not need explicitly make the the subtype axioms. The reasoner can figure this out for us. E.g. (an easy example): we can include GCI axioms (something) like:
mixs:term and mixs:package some mixs:SoilPackage => rdfs:subClassOf mixs:SoilPackage

wdduncan commented 4 years ago

Sharing Chris' table that illustrates some advantages and disadvantages:

	String DataProp	ObjProp	Class	Instance
Simplicity	+++	++	+	+
Ease of validation	-	++	+	-
Allows per-value-instance metadata	-	+	+	+
OBO-Like	---	-	+++	+

ramonawalls commented 3 years ago

Based on recent task group calls, the consensus seems to be to use properties.

Regarding object vs. data properties, a recent demonstration of the EBI Biosamples validation process (and other discussions) shows that we need to be able to record both a string and a URL for many of the properties. Some of them will remain free text, so we will always be recording string. We need a way, as @cjmungall said, to "have our cake and eat it too".

wdduncan commented 3 years ago

If we model the terms as properties, I think we should use object properties. This allows us to more easily handle different strings that denote the same thing (i.e., fits with the "things not strings" mantra).

If we are planning to make use of OWL reasoning, we might want to think some more about modeling terms as classes or individuals. E.g., Do we want to specify that certain packages are disjoint or that certain terms can only be members of certain packages? As far I know, that don't seem to be relevant, but please chime in if I am mistaken.

Another advantage of using object properties in conjunction with biolinkml is that we can create json schema templates to do some data validation.

cmungall commented 3 years ago

I agree with @wdduncan we should use object properties

I have sketched out how this could be done here:

https://docs.google.com/presentation/d/10j2dRtHnYZgspiNytaH9eVkQSi4Su0Pojt394uVItlo/edit#slide=id.p

This is using the mixs rendering in nmdc

https://github.com/microbiomedata/nmdc-metadata/tree/master/schema

ramonawalls commented 3 years ago

During the November meeting we agreed to use object properties, but the issue was never updated .We recognize that some applications will still want to use MIxS terms as data properties, but this can be handled with punning. See @cmungalls' comment above for implementation example.

GenomicsStandardsConsortium / mixs-rdf

MIxS as properties versus classes #9