semantic metadata module/extensions - Githubissues

NCEAS / eml

Ecological Metadata Language (EML)

https://eml.ecoinformatics.org/

GNU General Public License v2.0

40 stars 15 forks source link

semantic metadata module/extensions #25

Closed mbjones closed 5 years ago

mbjones commented 7 years ago

Author Name: Matt Jones (Matt Jones) Original Redmine Issue: 277, https://projects.ecoinformatics.org/ecoinfo/issues/277 Original Date: 2001-08-31 Original Assignee: Matt Jones

Need to extend EML, either by adding a new module or extending the current entity/attribute system, so that semantic metadata can be accommodated. Basically, this means being able to enter terms from an ontology (see bug 274) so that a particular data table attribute can be tied into the ontology. See the KDI proposal on canonical variables for more information.

mbjones commented 7 years ago

Original Redmine Comment Author Name: Matt Jones (Matt Jones) Original Date: 2004-09-02T16:38:17Z

Changing QA contact to the list for all current EML bugs so that people can track what is happening.

mbjones commented 7 years ago

Original Redmine Comment Author Name: Redmine Admin (Redmine Admin) Original Date: 2013-03-27T21:13:50Z

Original Bugzilla ID was 277

mbjones commented 7 years ago

Added new schema file eml-semantics.xsd for providing a new SemanticAnnotation type. Needs to be tested, reviewed, and incorporated into the other schemas.

mbjones commented 7 years ago

@mobb and @mpsaloha I wanted to bring this semantic extension for EML to your attention in particular. I'm just starting thinking about how this would work, but for now I committed a new eml-sematics.xsd file with a SemanticAnnotation ComplexType in commit sha 1dacda89507c344a4e26c27b0f9d7df30b3ab21e in the EML 2.2 branch. My thought is that I will add optional annotation elements using the SemanticAnnotation type in key structures in EML, particularly in the following places:

[x] ResourceGroup group, to get all resource types, including /eml/dataset
[x] EntityGroup group in eml-entity.xsd (covers dataTable, otherEntity, spatialRaster, ...)
[x] AttributeType complexType in eml-attribute.xsd

I would appreciate your thoughts on this. You can view the xsd file in the branch: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/xsd/eml-semantics.xsd

mbjones commented 7 years ago

@mobb, @amoeba, @mpsaloha, @csjx, @cboettig -- The new fields for populating semantic annotations are now present in the EML schemas in the BRANCH_EML_2_2, and I have linked them into three locations -- ResourceGroup, EntityGroup, and AttributeType. So, now you can add zero or more annotation fields to each of those structures in EML. We would typically be using the annotation in eml-attribute to attach OBOE-style annotations to the attributes in a data set. But you can also attach more general annotations to, for example, /eml/dataset and /eml/dataset/dataTable, which makes it broadly applicable as a semantic tagging module.

Could you please review, comment, and revise? Including the element and type documentation in the xsd files? Here's an excerpt from the eml-sample.xml document that shows the annotations in use:

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb"
    xmlns:eml="eml://ecoinformatics.org/eml-2.2.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.2.0 ../../../xsd/eml.xsd">

<dataset>
  <title>Data from Cedar Creek LTER on productivity and species richness
  for use in a workshop titled "An Analysis of the Relationship between
  Productivity and Diversity using Experimental Results from the Long-Term
  Ecological Research Network" held at NCEAS in September 1996.</title>
  <creator id="clarence.lehman">
    <individualName>
      <salutation>Mr.</salutation>
      <givenName>Clarence</givenName>
      <surName>Lehman</surName>
    </individualName>
    ...
  </creator>
  ...
  <keywordSet>
    <keyword>Old field grassland</keyword>
    <keyword>biomass</keyword>
    <keyword>productivity</keyword>
    <keyword>species-area</keyword>
    <keyword>species richness</keyword>
  </keywordSet>
  <annotation>
      <termURI>http://purl.obolibrary.org/obo/ENVO_01000177</termURI>
      <termLabel>grassland biome</termLabel>
  </annotation>
  <contact>
    <references>clarence.lehman</references>
  </contact>
  <contact>
    <references>richard.inouye</references>
  </contact>
  <dataTable id="xyz">
    <entityName>CDR LTER-patterns among communities.txt</entityName>
    <entityDescription>patterns amoung communities at CDR</entityDescription>
    <physical>
        ...
    </physical>
    <annotation>
        <termURI>http://purl.obolibrary.org/obo/ENVO_00000260</termURI>
        <termLabel>prarie</termLabel>
    </annotation>
    <attributeList id="at.1">
      ...
      <attribute id="att.12">
        <attributeName>biomass</attributeName>
        <attributeLabel>Biomass</attributeLabel>
        <attributeDefinition>The total biomass measured in this field
        </attributeDefinition>
        <storageType>float</storageType>
        <measurementScale>
          <ratio>
            <unit><customUnit>gramsPerSquareMeter</customUnit></unit>
            <precision>0.01</precision>
            <numericDomain id="nd.6">
              <numberType>real</numberType>
              <bounds>
                <minimum exclusive="true">0</minimum>
              </bounds>
            </numericDomain>
          </ratio>
        </measurementScale>
        <annotation>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass</termURI>
            <termLabel>Mass</termLabel>
        </annotation>
        <annotation>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-standards.owl#Kilogram</termURI>
            <termLabel>Kilogram</termLabel>
        </annotation>
        <annotation>
            <termURI>http://example.com/example-vocab-1.owl#PlantSample</termURI>
            <termLabel>Plant Sample</termLabel>
        </annotation>
      </attribute>
...
    </attributeList>
    <caseSensitive>no</caseSensitive>
    <numberOfRecords>22</numberOfRecords>
  </dataTable>
</dataset>
<additionalMetadata>
<metadata>
<stmml:unitList xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
    xsi:schemaLocation="http://www.xml-cml.org/schema/stmml-1.1 ../../../xsd/stmml.xsd">
    <!--note that the unitTypes here are taken from the eml-unitDictionary.xml-->
    <stmml:unit name="gramsPerSquareMeter" unitType="arealMassDensity" id="gramsPerSquareMeter" parentSI="kilogramsPerSquareMeter" multiplierToSI=".001"/>
    <stmml:unit name="speciesPerSquareMeter" unitType="arealDensity" id="speciesPerSquareMeter" parentSI="numberPerSquareMeter" multiplierToSI="1"/>
  </stmml:unitList>
  </metadata>
</additionalMetadata>
</eml:eml>

mbjones commented 6 years ago

Reviewed field definitions with @mobb and @mpsaloha last week, some discussion ensued about the role of termLabel and whether it must be constrained to a value chosen from the labels present in the definition found at termURI. I'm not sure how to accomplish that given that 1) how to resolve labels will vary for the different types of controlled vocabularies, 2) different vocabularies have different requirements for labels, and 3) labels may be optional in some vocabularies. The argument for termLabel to be drawn from the vocabulary is that it prevents people from minting new URI-label mappings without adding them to the vocabulary. The hard part is that many people may not have write access to the vocabulary, and so changes there may be impossible.

For example, let's take a hypothetical term for soil with a termURI of ex:Soil and rdfs:label of soil. If the user wants to use the German word for soil as their label in EML documents Boden, they could not do so unless they had the ability to modify the vocabulary, which would require a new release of the vocabulary. Is that too limiting? Or would allowing it let people informally redefine or misuse terms? Let's discuss.

cboettig commented 6 years ago

So wikidata tries to get around this issue of a language bias in semantic properties by referring to all there "entities" with opaque identifiers that can then by mapped to properties expressed in the user's native language. This can be somewhat cumbersome of course, see this open thread on how to handle this in JSON-LD: https://github.com/schemaorg/schemaorg/issues/1186.

(The codemeta map to wikidata properties is thus technically wrong, since we map to english property names like "operating system", which is really the property https://www.wikidata.org/wiki/Property:P306)

mbjones commented 6 years ago

@cboettig Yeah, that's what we do with the term URIs as well, following the well-established approaches used by the OBO Foundry. Thus, our termURI here should be opaque and non-semantic much of the time, thus the need for a termLabel to help with contextual display. As I outlined in the docs for termURI and termLabel (https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/xsd/eml-semantics.xsd#L74), clients could and should dereference the termURI to get additional information useful for display, including a wide variety of labels, the term definition, examples, and other metadata. But, I thought it important that the EML document should have at least one human-readable label included just in case the termURI turns out to be non-resolvable sometime in the future. You know, just in case 😉 . Mark is arguing that including the label is a problem and should be excluded, or at least constrained, and thus this discussion. I am arguing that the EML document should be at least moderately self-contained. It boils down to 'is linked data here to stay' and 'will all of these termURI fields be resolvable in 10-20 years'? If yes, we can omit the label. If no, then the termURI is pretty much useless without the label. Your 2 cents appreciated.

cboettig commented 6 years ago

Thanks, I think I see the question better now. It does seem like the general principle linked data documents should nevertheless try and be as self contained as possible.

Still, it sounds like you're saying that the de-referenced termURI already defines a "wide variety of labels", surely that list is thus the controlled vocabulary of possible labels? Seems like using the URI + one of those acceptable labels would be nice.

I'm not sure that it's any help, but I find the solution proposed for displaying that information in wikidata compelling: it recommends the (compacted property labels are used in the metadata document and mapped to (fully expanded) identifiers in the context file. In this way, you get a human-readable label, but it does not make the semantic annotation any more verbose, since the label is just an expandable short-hand for the full URI and thus doesn't change the semantics at all; rather than an additional semantic property that may or may not be an accurate / acceptable term.

Note that it doesn't require the URIs to all the individual properties to be resolvable; only the context file itself. Also in this approach, someone can always map their own natural language term to the full URI, but only by explicitly extending their context.

cboettig commented 6 years ago

Speaking of explicit information, the predicate and class are currently only implicit in the URIs too. Consider adding these explicitly, <annotation> e.g.?

Maybe Something like:

<attribute id = "att.12">
       ...
        <annotation>
           <property>oboe:Characteristic</property>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass</termURI>
            <termLabel>Mass</termLabel>
            <class>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Amount</class>
        </annotation>
        ...

Perhaps everything except the URI can be optional, but would in general be nice to have.

mbjones commented 6 years ago

@cboettig and I had a productive conversation on slack today, which resulted in agreement that we need to add a predicate/property for each annotation. While the subject of the annotation is clear from the context (albeit it may be hard to derive a URI for the subject), the predicate is ambiguous. I think Carl's proposal to add a property element is the right direction, but we may need to use the full property URI rather than the prefixed (and more readable) version (e.g., oboe:Characteristic), as we have no formal way to know how to dereference prefixes in this context. I guess we could also add a namespace declaration that lets one define oboe as a prefix for namespacing, or we could state that such namespaces should be defined as XML namespace prefixes, even though that's technically a different scope. If we did allow prefixes, then presumably they could be applied to termURI as well. So, I guess the question is which of the following is valid:

Option 1: Full URIs only

<attribute id = "att.12">
       ...
        <annotation>
            <propertyURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Characteristic</propertyURI>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass</termURI>
            <termLabel>Mass</termLabel>
        </annotation>
        ...

In this case, propertyURI is added as a required field of type xsd:anyURI.

Option 2: prefixed URIs with XML namespace declaration

<attribute id = "att.12" xmlns:oboe="http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#">
       ...
        <annotation>
            <propertyURI>oboe:Characteristic</propertyURI>
            <termURI>oboe:Mass</termURI>
            <termLabel>Mass</termLabel>
        </annotation>
        ...

In this case, propertyURI is added as a required field of type xsd:anyURI, but the URI is represented in a prefixed form that the XML processor wouldn't truly understand. Of course, the xmlns:oboe namespace could be declared at any scope above the current element, including the root of the XML document, which would make the whole thing more readable. The XML processor would not, however, know that the element was anything more than PCDATA with an xsd:anyURI type -- for example, I wouldn't expect the XML processor to detect that this is a QNAME in the XML sense. So, really the namespace here is declared outside of the XSD document and there's no way to validate with the XSD processor that a proper xmlns was declared. Unless maybe propertyURI can be typed as a QNAME; I'll have to look into that, which is addressed in https://www.w3.org/2001/tag/doc/qnameids-2004-01-14.html .

Option 3: prefixed URIs with element for XML namespace declaration

<attribute id = "att.12">
       ...
        <annotation>
            <namespace prefix="oboe">http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#</namespace>
            <propertyURI>oboe:Characteristic</propertyURI>
            <termURI>oboe:Mass</termURI>
            <termLabel>Mass</termLabel>
        </annotation>
        ...

Same issues apply here as in option 2, but its less convenient than having xmlns declared in the XML document. It has the advantage that namespace can be a required field, and so the document can validate that all of the info needed to form the URI is present. But it means namespace would be declared many times (unless we shifted to some sort of cumbersome key/keyref solution.

I think I lean towards option #2.

Other issues -- the class element

Finally, @cboettig, I'm not sure what you are intending with the class element in your example. Could you elaborate? The class of the termURI is determined by its definition in its ontology, and so I'm unclear what you are attending here (especially because oboe:Mass and oboe:Amount are disjoint classes, so I really don't understand the example.

Call for feedback from @mobb, @mpsaloha, and @csjx among others! Feedback appreciated.

cboettig commented 6 years ago

Thanks Matt for the much better summary. Option 1 certainly feels "safest" and by far the least cumbersome to deal with from a programmatic standpoint, (which to me is more important than looking pretty to a human since I think raw XML is better read by machines than humans...).

Note that NeXML schema has something very similar to this; though they opted for rdfa meta nodes that use namespaces inside attribute values; this can be a pain to work with programmatically https://github.com/ropensci/RNeXML/issues/51.

Sorry that my class example failed; I clearly wasn't able to parse the correct class association for oboe:Mass. I believe our discussion raised the example of dealing with multiple inheritance. Here's a simple schema.org example:

{
"http://schema.org/copyrightHolder": {
  "@type": "Organization",
  "@id": ...
  "name": ...
  }
}

The property copyrightHolder could refer to an object that is either an Organization or a Person type; and while technically we might be able to infer this information from the associated URI (@id), I understand best practice here is to declare the type (well, the JSON-LD documentation says we should always declare @type for linked data, because linked data should be self-describing: https://json-ld.org/spec/latest/json-ld-api-best-practices/#typed-objects). Really I'm just trying to riff on those suggestions.

mbjones commented 6 years ago

I've now pushed an implementation of Option 1 to the 2.2 branch in sha d1a8c74ed9c4baa5a7da3c4cd0d4d01520f10e94, and an example document in sha ffba2d9cb7b33333ff31d3a594f5c5ba1683c643. Under this proposal, an example set of annotations in an EML attribute would be:

      <attribute id="attribute01">
        <attributeName>tmpair</attributeName>
        <attributeLabel>Air Temperature</attributeLabel>
        <attributeDefinition>Air temperature at 1m from ground.
        </attributeDefinition>
        <storageType>float</storageType>
        <measurementScale>
          <interval>
            <unit><standardUnit>celsius</standardUnit></unit>
            <precision>0.5</precision>
            <numericDomain id="nd.1">
              <numberType>real</numberType>
            </numericDomain>
          </interval>
        </measurementScale>
        <annotation>
            <propertyURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#ofCharacteristic</propertyURI>
            <propertyLabel>characteristic</propertyLabel>
            <valueURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Temperature</valueURI>
            <valueLabel>Temperature</valueLabel>
        </annotation>
        <annotation>
            <propertyURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#ofEntity</propertyURI>
            <propertyLabel>entity</propertyLabel>
            <valueURI>http://purl.obolibrary.org/obo/ENVO_00002005</valueURI>
            <valueLabel>air</valueLabel>
        </annotation>
      </attribute>

While I think Option #2 would be more concise and readable, it also is more complicated for people and machines to process. Under option 2, the current URI elements would be typed as QNames, and then a processor would need to understand how to resolve the QName to a {uri, localName} pair, and from that construct the URI for the term. As discussed in the TAG finding on QNames in content, there are multiple ways to construct such a URI, and applications must define their algorithm. Thus, processing an EML document with annotations as QNames would be more complex than using plain URIs for the terms, and so that's where I left it.

Comments @cboettig, @mpsaloha, @mobb, or @csjx?

I am moving this ticket into review, and will plan to close it soon in the absence of comments.

cboettig commented 6 years ago

👍 for option 1, that looks good to me. (As a side note, https://github.com/cboettig/emld/issues/2 comments on what this would look like translated into a JSON-LD representation of EML, which would be more concise and easy to convert back into option 1 when rendering back to XML.)

mobb commented 6 years ago

I think this will work, but we will need some examples that are less arcane, and use classes from sources other than OBOE. It's important that the people who understand EML now can still do so in 2.2 -- that it doesn't take deep ontology knowledge to use

mbjones commented 6 years ago

@mobb - thanks. So, I think the example I provided is pretty straightforward -- how to say that a column represents a measurement of Temperature of Air. Not sure how much simpler it could get. But it does show how non-intuitive using URIs for both properties and values would be for someone unfamiliar with RDF and semantic terminology. I think we would clearly need to write a primer on how to use this annotation feature effectively. Right now, though, I think the main things we need agreement on are:

1) does the PropertyURI/valueURI approach work generically 2) is including a single label for each URI sensible so that resolution is not needed to produce more human readable output 3) Are the places that I allowed annotations adequate? 1) Resource/Dataset, 2) Entity, 3) Attribute, given that people could add additional annotations through additionalMetadata as desired using the describes element

Had you noticed that I opened this ticket on 2001-08-31, which is 16 years ago? Yikes! I for one no longer want to let perfect be the enemy of good, and get this out the door.

gastil-buhl commented 6 years ago

Hi Matt,

I'm the guinea pig we used to see if a non-ontology-expert could understand these new annotation elements. I think I kind of understand what it is doing, but I would not be able to annotate attributes myself given just this example. So a primer would be useful, thank you.

My biggest confusion was about propertyLabel. Margaret explained the word label has a special meaning in ontologies. Besides entities and characteristics (which I read as things and stuff things have) are there other propertyLabel values expected? Is it a limited controlled vocabulary of propertyLabel's?

I agree done is better than perfect. Ideally, stuff gets done in a way it can grow logically, not done in a way that makes extension complicated.

Gastil

On Thu, Jan 4, 2018 at 5:31 PM, Matt Jones notifications@github.com wrote:

@mobb https://github.com/mobb - thanks. So, I think the example I provided is pretty straightforward -- how to say that a column represents a measurement of Temperature of Air. Not sure how much simpler it could get. But it does show how non-intuitive using URIs for both properties and values would be for someone unfamiliar with RDF and semantic terminology. I think we would clearly need to write a primer on how to use this annotation feature effectively. Right now, though, I think the main things we need agreement on are:

does the PropertyURI/valueURI approach work generically

is including a single label for each URI sensible so that resolution is not needed to produce more human readable output

Are the places that I allowed annotations adequate? 1) Resource/Dataset, 2) Entity, 3) Attribute, given that people could add additional annotations through additionalMetadata as desired using the describes element

Had you noticed that I opened this ticket on 2001-08-31, which is 16 years ago? Yikes! I for one no longer want to let perfect be the enemy of good, and get this out the door.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/25#issuecomment-355448739, or mute the thread https://github.com/notifications/unsubscribe-auth/AE8gZA9Xz5oK-fHRDwZnm6FRzz8OwaXLks5tHXtUgaJpZM4MaZO8 .

mbjones commented 6 years ago

@gastil-buhl thanks so much for the comments. I think the challenges you describe are very real, and we'll need to work on good documentation, including a primer. But I think they will be present for any semantically-precise implementation we might choose for EML. But re-reading the documentation I wrote, its clear that I could be more concrete in describing just how the annotation abstraction (and it is definitely an abstraction) works. Its a very meta-level concept. In short, each annotation asserts some information about a part of an eml document, and that information is expressed as a property and a value, both of which are drawn from controlled vocabularies.

For example, I might want an annotation to say:

variable1 hasStorageType float

In this annotation, variable1 is the EML attribute that is being annotated (i.e., we are saying something about it), the property that we are asserting about variable1 is hasStorageType, which has the value float.

But the words 'hasStorageType' and 'float' are semantically ambiguous, in that there can be multiple definitions of those words. So, rather than using the human readable (and ambiguous) word 'hasStorageType', we instead use the URI for that term the provides a formal definition in its controlled vocabulary (something like http://example.com/vocab1/hasStorageType). So 'hasStorageType' is just a label we use to display the term defined by the URI. Similarly, the word 'float' is just a label used to display the more precise term that it represents (e.g., http://example.com/vocab1/float). So, in reality, the true annotation is expressed using URIs, not labels:

variable1 http://example.com/vocab1/hasStorageType http://example.com/vocab1/float

So the labels are just human readable strings to substitute for the controlled term URI when displaying the information.

Maybe this helps? Clearly we'll need help writing clear documentation. The challenge will be in being both clear and concise. A primer will allow us to be more complete than we can be in the EML specification itself.

mbjones commented 6 years ago

Some people have requested that the definition of the annotation type include mention of the ability to include it in the additionalMetadata field of EML. In this case, the describes element would be used to define the subject of the annotation triple. Once I add that documentation, I think this feature is ready for release. We should, however, open another ticket for a primer document.

gastil-buhl commented 6 years ago

Yes Matt that does help. How you explained it there would be useful text for a guide.

On Thu, Jan 4, 2018 at 11:07 PM, Matt Jones notifications@github.com wrote:

@gastil-buhl https://github.com/gastil-buhl thanks so much for the comments. I think the challenges you describe are very real, and we'll need to work on good documentation, including a primer. But I think they will be present for any semantically-precise implementation we might choose for EML. But re-reading the documentation I wrote, its clear that I could be more concrete in describing just how the annotation abstraction (and it is definitely an abstraction) works. Its a very meta-level concept. In short, each annotation asserts some information about a part of an eml document, and that information is expressed as a property and a value, both of which are drawn from controlled vocabularies.

For example, I might want an annotation to say:

variable1 hasStorageType float

In this annotation, variable1 is the EML attribute that is being annotated (i.e., we are saying something about it), the property that we are asserting about variable1 is hasStorageType, which has the value float .

But the words 'hasStorageType' and 'float' are semantically ambiguous, in that there can be multiple definitions of those words. So, rather than using the human readable (and ambiguous) word 'hasStorageType', we instead use the URI for that term the provides a formal definition in its controlled vocabulary (something like http://example.com/vocab1/ hasStorageType). So 'hasStorageType' is just a label we use to display the term defined by the URI. Similarly, the word 'float' is just a label used to display the more precise term that it represents (e.g., http://example.com/vocab1/float). So, in reality, the true annotation is expressed using URIs, not labels:

variable1 http://example.com/vocab1/hasStorageType http://example.com/vocab1/float

So the labels are just human readable strings to substitute for the controlled term URI when displaying the information.

Maybe this helps? Clearly we'll need help writing clear documentation. The challenge will be in being both clear and concise. A primer will allow us to be more complete than we can be in the EML specification itself.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/25#issuecomment-355487305, or mute the thread https://github.com/notifications/unsubscribe-auth/AE8gZLCTsjwjNp_kr7qYAY7qttCTLzSZks5tHcorgaJpZM4MaZO8 .

amoeba commented 6 years ago

Hey @mbjones this looks pretty good. A few thoughts came to mind:

Did we ever talk about this XML structure as an alternative?

<annotation>
  <property label="ofCharacteristic">http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#ofCharacteristic</property>
  <value label="Temperature">http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Temperature</value>
</annotation>

which feels a little more XML-ish. The label attribute could be optional. How it is now is fine with me though. I tend to shy away from XML attributes.

In the examples above, I see different types of things put in the property* and term* elements. I'm not sure what the correct way to use this is. In some examples, property contains a predicate (e.g., oboe:ofCharacteristic) and in others, it uses a class (e.g., oboe:Characteristic). In your most recent example, you put predicates in property which makes the most sense.

In this case, when we annotate an attribute with annotations of ofCharacteristic X and ofEntity Y (as in your above example), are we implying that the attribute is of type oboe:Measurement, but only though induction? Is there any benefit to explicitly stating this through another annotation with an rdf:type property and the class as the value?
In terms of workflows, I wonder about how this fits into the existing EML R package. Currently, a data.frame is used as universal data structure for working with attributes in R, with each row corresponding to one attribute and each column describing the information about that attribute.
```
# eg
data.frame(attributeName = ..., 
         attributeLabel = ..., 
         measurementScale = ..., 
         domain = ...)
```
Since each attribute can have zero or more annotations, we'd either (1) have to break the one-row-per-attribute model, (2) use list columns, or (3) describe each attribute's annotation in a separate data structure. Options (2) and (3) seem reasonable and achievable given the proposed schema so I don't see any issues. I'd probably go with (3). Just wanted to get that out there.

Regarding your above points,

does the PropertyURI/valueURI approach work generically

I think so.

is including a single label for each URI sensible so that resolution is not needed to produce more human readable output

Yes, though there is a chance for the label to be out of sync with the URI if a user mis-types the information or if the rdf:label of a term changes in the ontology.

Are the places that I allowed annotations adequate? 1) Resource/Dataset, 2) Entity, 3) Attribute, given that people could add additional annotations through additionalMetadata as desired using the describes element

I think so. Putting the annotations in-line reduces the need for the user to look in multiple places in the document for the information. And allowing additional annotations to be put into the additionalMetadata sets aside a catch-all place for annotations that don't belong elsewhere. Why did we decide to use a separate element though? It might be nice if I could just run the XPath //annotation to grab all the annotations in a document.

mbjones commented 6 years ago

After discussion with @mpsaloha and @mobb, we agreed that it would be good to make propertyURI and propertyLabel optional to create a simpler case when someone wants to just generically tag a resource, entity, or attribute. I made this change in SHA 733a650e43d4d31ca6ed04d71e1326687209a79b, and provided documentation that indicates the default property in the case one is omitted. For resource and entity subclasses, the default property is Dublin Core Element Set dc:subject, which lets us indicate generally the topic associated with a data set or a data table. For attribute elements, the default property is oboe:MeasurementType, which lets us associate attribute semantics with the variable.

mbjones commented 6 years ago

@amoeba Thanks for the comments.

Your proposal to nest the label in property and value elements is nice and readable and compact. I considered it, and decided initially not to because its harder to access attributes using XPath. However, seeing it written out the way you did makes the whole block much more readable, so I am inclined to agree this would be good.
I'll have to look through the examples, but they should be consistent in the docs and examples. I agree that property should contain something that is like a predicate, whereas value should contain some class. Feel free to point out or fix specific places where the examples in the EML directory are wrong (the github ticket has earlier versions so I wouldn't worry about that per se).
We'll definitely need to discuss how to handle this in the EML R package functions.

I'm unclear on what your final point is about how we decided to use a separate element? They are all of the same type, and //annotation should work fine, although it loses the resource/entity/attribute parent that is critical to knowing the subject of the triple to be generated.

amoeba commented 6 years ago

I coulda made that comment more clear. I was referring to this:

through additionalMetadata as desired using the describes element

This means my XPath to grab all annotations has to look for both the annotation and the describes tag unless I misunderstand something.

cboettig commented 6 years ago

I might just not be following things correctly at this stage, but I think I'm not in favor of this new proposal. I don't like not having a property URI, or having a property URI that is only defined by some implicit convention (let alone two separate default conventions depending on context). I think it's fine if user-facing tooling wants permit a default property to make it 'easy' to tag an entity or attribute, but I think the property URI should be written explicitly into the EML. I think the EML schema itself is not the best way to establish this kind of implicit or default property (partly because I don't see where that definition exists, other than in documentation), and I think it is asking for trouble at some stage.

I would like to be able to treat any EML node as the subject (the parent node of the annotation, as in turtle or JSON-LD, or RDFa), and always have predicate/property URI and object/value of the triple clearly stated.

(To me, RDFa still seems like the most obvious way to add semantics to XML, and permits existing technology (any RDFa parser) to extract the semantic annotations with minimum fuss, though I don't particularly like RDFa notation). Unrelated issue I probably should have asked earlier, but I admit I'm also lost as to why you enforce that the value is a URI at all -- why not permit Literal valued objects?

csjx commented 6 years ago

You know, after catching up a bit on this thread, I do wonder why we are limiting annotations to resources, entities, and attributes, other than the very practical reasons that it limits the scope which limits the implementation changes. I certainly get that.

I think I agree with @cboettig here that it would be nice to apply annotations to any element in the EML. I like that @mbjones has put the time into defining the eml-annotation.xsd module, and that seems like how we can validate annotation syntax. But I've always thought they would be similarly to customUnits that we drop into /eml/additionalMetadata. But, to avoid the pain that is xs:any content, I'm wondering about adding a top level optional /eml/annotations element, which would be a list 0..n annotation elements however we define them per the discussion above. The annotation so far provides the predicate and object of the triple, and so I'm thinking that we could define the subject as a references element that points to the id of the element being annotated. Something like:

<eml packageId="4cdb6dd6-66c2-478a-af66-9969a3142813" ...>
    <dataset>
        <title>We heart data science</title>
        <creator id="12345">
            <individualName><surName id="54321">Mecum</surName></individualName>
        </creator>
        <contact><references>12345</references></contact>
    </dataset>
    <annotations>
        <annotation>
            <references>54321</references>
            <propertyURI label="a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
            <valueURI label="familyName">http://xmlns.com/foaf/0.1/#familyName</valueURI>
        </annotation>
    </annotations>
</eml>

The crux here is to allow for the XML id attribute on effectively any [or almost any] element defined in the schema. As I understand it, we have only added an id attribute to certain elements for referencing within the instance documents, but this would be a uniform, optional, backward-compatible change (I think).

One advantage of doing this is that we might be able to convert these non-compliant triple statements into RDF-compliant statements by concatenating some dereference-able URI and the id attribute to create a subject URI, like:

https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321

Hmm, now that I look at that, we'd have a problem interpreting the pid from the document fragment reference. But you get what I mean. The ids are all unique references into the XML document as anchors.

Well, food for thought.

amoeba commented 6 years ago

@cboettig wrote:

but I think the property URI should be written explicitly into the EML.

👍 I didn't notice that, if this is actually the case, when I read over the proposal.

I would like to be able to treat any EML node as the subject (the parent node of the annotation, as in turtle or JSON-LD, or RDFa), and always have predicate/property URI and object/value of the triple clearly stated.

👍 Though what, in this case, is the URI of the Subject? i.e., can we export all of the annotations in an EML document into a triple store?

cboettig commented 6 years ago

Love @csjx suggestion of having an option for an id attribute on every element (at least every complex type). That also addresses @amoeba's second question, since the id of the parent node is subject URI.

The NeXML schema in phylogenetics works this way via RDFa meta elements with the about attribute to refer to any node in the XML document, where all nodes can have an id.

I see what Chris's example is trying to say, but I find it confusing to think of foaf:familyName as a valueURI instead of a propertyURI, and I can't quite figure out the resulting triples. Basically I don't think this makes sense for simple types, which take values rather than nodes as their argument. (i.e. in JSON-LD, you cannot have both @id and @value

I imagine something like:

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"
"packageId": "http://dataone.org/abc123">
  <title>Sample Dataset Description</title>
  <creator id="23445" scope="document">
    <annotation>
      <propertyURI>https://schema.org/birthDate</propertyURI>
      <valueLiteral typeOf="xs:Date">1980-02-02</valueLiteral>
    </annotation>
    <individualName>
      <surName>Smith</surName>
    </individualName>
  </creator>
  <contact>
    <references>23445</references>
  </contact>
</eml:eml>

which would contain the single triple

<http://dataone.org/abc123#23445> <https://schema.org/birthDate> "1980-02-02"^^Date

Note the above <annotation> is clearly equivalent to

<meta property="https://schema.org/birthDate" content= "1980-02-02" typeOf="xs:Date">

which has the advantage that the triple could be extracted by any existing RDFa->RDF stylesheet and doesn't involve creating any new syntax.

(Though actually I think it makes more sense to interpret the whole document as triples, like this)

Okay, maybe I'm way off the deep end now, feel free to pull me back. 🏊

csjx commented 6 years ago

Ah, I see what you mean. @cboettig wrote:

I see what Chris's example is trying to say, but I find it confusing to think of foaf:familyName as a valueURI instead of a propertyURI

I guess I was trying to disambiguate the EML surName term and assert that it is of rdf:type foaf:familyName, but that raises the question of how you reference the "Mecum" value in order to annotate it. I like your idea that the parent element with an attached id references the value contained within that element. Yeah, I don't see any other way to identify the content uniquely.

Regarding your example of putting the annotation element as a child of the creator: that would assume we would change the content model of all complexTypes in EML and add in an optional annotation element. This can be done, but to me it has more overall impact across all modules (I guess in a visual sense). I suggested that we consolidate all annotations inside of /eml/annotations (somewhat less obtrusive), so I think this is a point to raise with others to see what people like. I do in fact like the annotation co-occurring right with the element it is annotating. In my example it is a step removed, so may be harder to grok when perusing the EML, which I'm sure we all like to do on a Saturday evening. 😜

mbjones commented 6 years ago

Thanks for all of the input, @cboettig, @csjx, and @amoeba. Good stuff.

Regarding the optionality of the property field: I agree, but @mpsaloha and @mobb felt strongly in the other direction, so I was trying to accommodate their desire to have a default property. I agree with you that having the property be explicit is important and far more manageable within the context of EML. I'll wait for some feedback from the others, but I think I will plan to move it back to having property be required. Getting more voices on this issue would be helpful.

Regarding serialization, I like Bryce's suggestion of embedding the property and value labels as attributes in the parent element, and will plan on making that change in the next revision.

Regarding the use of annotations in additionalMetadata, that was what I was intending all along. That is functionally equivalent to what @csjx proposed with the <annotations> element that spans the document. The nice thing about the <annotations> element is that it indicates explicitly in the EML schema that providing annotations on any element with an id is intended behavior, whereas it is only implicit in additionalMetadata. This is the same reason why I think it is helpful to provide an explicit annotation element for the major places where we will really look for semantic clarifications, mainly the dataset, entity, and attribute elements. By making the possibility of the annotation clear, I think people will be more likely to provide them, which is particularly important for attribute elements. So, what I take from this is that we should allow annotations in 3 types of places, and harvest them all up at the time of schema parsing:

in attribute, entity, and dataset (or other resource) elements
in an /eml/annotations root element
in /eml/additionalMetadata

Finally, @cboettig brought up a really new example with his use of a typed literal as the value of one of his example statements. I had been specifically avoiding the use of literals, as once we go down that slope we are really just re-inventing the RDF model within EML. The reason for the annotation element, in my mind, is to clarify for semantics of the existing literal values in an EML document. So, when we have an attribute with the literal attributeName=littermass, we can semantically clarify that the property measured might be Biomass. In contrast, Carl's example adds a whole new literal value (birthdate) to the ResponsibleParty type without extending the EML -- it basically would mean that people could add any literal to any EML element, and at that point we might as well just eliminate the XML serialization for EML and move to a full RDF serialization, which would be far easier to process than a mixed model. If we are to stick with an XSD schema for EML, I think the literal values should in general be modeled as extensions to the EML types. This is why I wrote the value element as having type xsd:anyURI, which precludes it from being a literal. I'm sur ethis will engender discussion.

As these issues are getting complex, and this conversation is dragging out, I think we should schedule a call to discuss the merits of the various proposals for annotation and come to some decisions so we can move forward with this. I will try to find a time this week on the EML slack channel (available on https://slack.nceas.ucsb.edu).

cboettig commented 6 years ago

@mbjones great points all around, and agree this would be nice to hash out in a more real-time discussion on the slack channel. Meanwhile I'm going to take the liberty of jotting a few notes here just in case scheduling on slack doesn't work out for me.

:+1: from me for keeping property explicit; though I'd love to hear the perspectives from others on that. I feel there are some really important points being made there, but also that they are better addressed at a user tooling level rather than in the schema itself.
👍 on @amoeba label as attribute notation.
👍 on consolidating all the annotations
[ ] Would love to hear more discussion on @csjx 's proposal of expanding the use of an id element to more complexTypes. I think this opens the door to more semantic annotation use cases we might not be thinking of right now (maybe not everyone sees that as a good thing), but more generally I think the ability to use id to reference nodes is a very useful feature.

Okay, bigger issues (in which i'm probably jousting at windmills, but anyway)...

Right, I appreciate the objective of the semantic extensions, as envisioned here, is really to provide some more precise semantics to existing literals such as measurements and not to open pandora's box to arbitrary RDF statements. I'm not entirely clear that using semantic annotations and restricting those annotations to URI types is the best way to accomplish though -- there's still a lot that can be expressed as URIs that go outside of this scope, and it still means that EML is inherently a mixed-type model from which I can neither easily extract generic RDF statements that have much meaning nor predict all of the properties. If the goal is narrowly to, say, define attributes in terms of OBOE properties, I wonder if we shouldn't be XML-izing OBOE and extending EML explicitly with those terms rather than adding arbitrary URIs? (That sounds complicated and I personally I don't advocate for that path, but just throwing it out there as a thought experiment).

I see/agree that adding literals and permitting annotations on arbitrary nodes would allow arbitrary semantic extensions without extending EML. Likewise, I agree that in such case, it makes sense to treat all EML as RDF, and not just a few random semantic nodes (e.g. in my example, it makes little sense to extend creator with a birthDate as a semantic annotation if the rest of the creator metadata is not also accessible as the obvious triples). However, I don't think this means abandoning the XSD schema.

I see that if any random RDF is suddenly valid EML that we've pretty much lost any advantage in having a well-defined schema and we're pretty much back in a mess where you don't know where to find the title of the dataset (is it dc:title or schema:name?) let alone anything more complex, and I'm not advocating for that at all. I currently think of EML as functionally equivalent to JSON-LD modulo some syntax: defining nested objects in a predictable structure in a well-defined context (i.e. the EML namespace). (Since JSON-LD has 1:1 map to RDF, this is semantic, but also has a pretty obvious 1:1 map to XML and maintains the notion of schema validity). I think of any semantic annotations on top of EML as necessarily being outside of the EML @context (in the JSON-LD sense), i.e. that a tool should always be able to ignore these and still get a meaningful picture, but that a particular family of tools could also be defined to work on an extended @context, e.g. EML + oboe extensions. I think this allows meaningful semantic extensions

(a) can see the EML in which they are embedded semantically,
(b) obey all the standard syntax and rules of semantic markup anywhere else, no arbitrary gotchas of default types or URI-values only
(c) encourages that extensions are clearly scoped and defined in an 'extended' context
(d) Anything outside of the base EML context can easily be stripped off and what remains can be validated against the EML schema

I think such an approach makes it more obvious to developers and consumers how to interact with a semantic layer in EML, or more generally, how to interact in EML semantically, without making arbitrary semantic extensions into first-class citizens that any parser must suddenly be able to deal with.

Anyway, treating all of EML as JSON-LD or RDF is pretty far from the topic here, so like I've said before, forge ahead with the practical, but I do think it provides a nice illustration of something that is both extensible and flexible but doesn't lose any of the power we gained in the first place from a rigidly-defined schema -- after all you can always transform into the XSD-valid schema representation. I've mentioned this before and pestered @amoeba with a proof-of-principle to transform between RDF, JSON, and EML-valid XML: https://github.com/cboettig/emld. (Really this is just the back-end for a re-write of the EML package that uses lists instead of S4, which provides what I hope will be a lot more intuitive interface for most R users or developers. https://github.com/cboettig/eml2)

okay, probably lost everyone now, so better turn in for the night. :moon: 🛌

mpsaloha commented 6 years ago

Hi, I'm not an XML maven like most of you here, so I don't have opinion nor insight on the specifics of serialization solutions from the XML guts of EML. However, Matt asked me to look over the latest comments on semanticizing EML and here are my thoughts, fwiw:

the proposed "valueURI" should probably be renamed as "objectURI" as it appears consistently to be used as an object for an RDF triple-- and as such it can contain either literals (i.e. values) or URI's -- unlike subjects and predicates that can only be represented by URI's.

Carl's concern that "foaf:name" doesn't seem like a "value" (triple object) is reasonable, since that foaf term is indeed a property (predicate). Looking at csjx suggestion about using the EML 'id' (comment from 13 days ago), I think a desired triplification {s,p,o} would be:

\\\<"Mecum"> or \https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321 \ \<"Mecum">

(where the challenge may be minting that httpURI in the subject position?

and we'd hope that rather than \="Mecum", things will eventually devolve to

\https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321 \\http://orcid.org/0000-0002-0381-3766

(above isn't quite correct since foaf doesn't offer appropriate property "foaf:orcidID", although it has "foaf:skypeID" and "foaf:icqchatID" etc.!)

For many of our Use Cases, however, I think we frequently "simply" (ha!) want to mint an http URI that points to the specific element in EML for which we want to add semantics. So, going back to Matt's early comments from Jan. 4

\<"variable1"> \http://example.com/vocab1/hasStorageType \http://example.com/vocab1/float

what most excites me is being able to frequently assert: \<"variable1"> \ \

(...although note again that we'd need to mint an httpURI in the subject position.)

Also, this is why in discussion with Matt and Margaret, I had suggested that, at least for attribute metadata, a default propertyURI could be "rdf:type", simple asserting class membership of some URI-specified instance ("variable1" in this case), as a member of Class "measurementTypeXXX".

That "measurementTypeXXX" would be defined in our ECSO ontology and accessible with its PIRI GUID (PURL, that is) --with, e.g. an rdf:label or skos:prefLabel of "air temperature"-- and appropriate axioms about what characteristic ("temperature"), what entity ("air"), and potentially dimensions ("degrees Celsius") etc describe that MeasurementTypeXXX. But all that additional information would/could be garnered from dereferencing the ECSO URI in the object position of the triple.

Well, hope this isn't too muddled or trivial...it's kind of turtle-ish.

mbjones commented 6 years ago

After several offline conversations, we have reached consensus on implementing annotations using just property and value URIs, which in turn can be located in 5 locations in the EML document:

in attribute, entity, and dataset (or other resource) elements
in an /eml/annotations root element
in /eml/additionalMetadata

We've also agreed to embed the label in the element for readability. So a typical annotation would look like:

<annotation>
    <propertyURI label="uses unit">http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#usesStandard</propertyURI>
    <valueURI label="Kilogram">http://ecoinformatics.org/oboe/oboe.1.2/oboe-standards.owl#Kilogram</valueURI>
</annotation>

In that case, the annotation is embedded in a containing EML attribute element, and so the annotation's subject is that attribute. Constructing a URI for the subject can be done by appending the element identifier onto the document URI with a fragment identifier.

For annotations in /eml/annotations, the subject of the annotation is established using a references attribute that points at the id of the subject of the annotation. In working through the implementation of the 'annotations' element at the top level EML module, I decided its cleaner to treat references as an attribute, so that the annotations list ends up like this:

<annotations>
    <annotation references="CDR-biodiv-table">
        <propertyURI label="Subject">http://purl.org/dc/elements/1.1/subject</propertyURI>
        <valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
    </annotation>
    <annotation references="adam.shepherd">
        <propertyURI label="is a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
        <valueURI label="Person">https://schema.org/Person</valueURI>
    </annotation>
    <annotation references="adam.shepherd">
        <propertyURI label="member of">https://schema.org/memberOf</propertyURI>
        <valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
    </annotation>
</annotations>

For annotations in /eml/additionalMetadata, the subject is determined to be the element has the id listed within the associated described element:

<additionalMetadata>
    <describes>adam.shepherd</describes>
    <metadata>
        <annotation>
            <propertyURI label="member of">https://schema.org/memberOf</propertyURI>
            <valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
        </annotation>
    </metadata>
</additionalMetadata>

That should wrap up implementation of the annotation field implementation. Merge commit is SHA fbafee0a2a8c45f056551254e90c3a8e5478501c.

mbjones commented 6 years ago

After discussion, we agreed to add language to conditionally require the use of the id on elements that contain annotation elements with an implied subject. The id would then be used to construct a subject URI based on the document's base URI plus a fragment identifier, such as https://dataone.org/datasets/{dataset-identifier}#element-id

We decided that making id mandatory everywhere would be backwards incompatible and therefore undesirable, despite the fact the benefits of having unique ids to reference documents elements.

This requires an addition to EML Parser.

Reopening until I can update this documentation and EMLParser.

mpsaloha commented 6 years ago

IThis sounds good to me, but I think we need to clarify the constraints, if any, on the contents of the “value=“ fields for the two URI elements. Are these free text or {should | must} these be populated by an rdfs:label or skos:prefLabel if such exist? Although we have (i believe) encountered cases where there is no helpful Annotation Property of this sort and some natural language semantics is implied by the URI itself...

On Wed, Jul 25, 2018 at 4:50 PM Matt Jones notifications@github.com wrote:

Reopened #25 https://github.com/NCEAS/eml/issues/25.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/25#event-1753843686, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-YW9VIumfl6M35MVZyu8GtKhS200ks5uKQQogaJpZM4MaZO8 .

mobb commented 6 years ago

for the two label fields, my opinion is that it's a { should } be populated by an rdfs:label or skos:prefLabel if one exists.

Mainly, because to say {must} would mean that we ought to be able to confirm the label is correct, which is not practical. Communities may want to do their own checking however, which would be tied to specific vocabularies.

mpsaloha commented 5 years ago

At the LTER ASM breakout discussion on vocabularies, there was great interest in how to use/substitute formally defined (i.e. by specifying a dereferenceable GUID from a term in an (approved) thesaurus or ontology) terms as EML KEYWORDS. Some discussion ensued that semantic annotation at the level of dataset and entity essentially constitute EML KEYWORDS describing the object at that level. We and potentially the LTER Community need to agree on best practices in this process. Clearly having well-conceived EML KEYWORDS will be a major boon (and possibly opens up some interesting uses for Object Properties).