BiodiversityOntologies / bco

Biological Collections Ontology
Creative Commons Zero v1.0 Universal
22 stars 3 forks source link

Update Darwin Core import #107

Closed ramonawalls closed 3 years ago

ramonawalls commented 4 years ago

It has been years, and I am sure DwC has changed. We need to regenerate the owl file from DwC as RDF. @tucotuco do you remember how we did that?

dr-shorthair commented 4 years ago

http://rs.tdwg.org/dump/iri.ttl (thanks @baskaufs )

dr-shorthair commented 4 years ago

There is no axiomatization though - not even RDFS. I started building the obvious parts here: https://github.com/dr-shorthair/dwc/blob/master/rdf/axioms.ttl

ramonawalls commented 4 years ago

Thanks, @dr-shorthair! I just saw your comments. I see that you have also started to address issue #10, by sorting out which DwC terms are object versus data properties. I'll be working on this over the next few days, and will run the outcomes by you and @tucotuco before releasing anything. Would also be great to get feedback from @baskaufs.

baskaufs commented 4 years ago

It would be great for the object property/datatype property sorting that you do be consistent with the recommendations in the normative term reference section of the RDF Guide, which should sort out all DwC properties as you are desiring. @tucotuco and I sat down once for about an hour and sorted them all out, but it is quite possible that we didn't get them all right, so additional eyes and opinions would be great. If some properties are mis-sorted, we should change the guide.

In general, every dwc: property is a datatype property and every dwciri: property is an object property. So the issue is really whether there are some dwc: properties that should have dwciri: analogs but don't. I'll try to write a SPARQL query to give you that list. This is not something that should need to be done manually now that all of the TDWG term data is available in machine-readable form.

baskaufs commented 4 years ago

OK, you can go to https://sparql.vanderbilt.edu/ and run the following queries to get various categories of terms. That SPARQL endpoint has all of the TDWG term/vocabulary/standards metadata loaded in the graph http://rs.tdwg.org/ .

all current Darwin Core-defined (non-borrowed) properties (221 results)

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix dwc: <http://rs.tdwg.org/dwc/terms/>
prefix dwciri: <http://rs.tdwg.org/dwc/iri/>
select distinct ?dwcproperty
from <http://rs.tdwg.org/>
where {
  ?dwcproperty rdf:type rdf:Property.
  {  ?dwcproperty rdfs:isDefinedBy dwc:.}
  union
  {  ?dwcproperty rdfs:isDefinedBy dwciri:.}
  minus
  {  ?dwcproperty owl:deprecated "true"^^xsd:boolean.}
}
order by ?dwcproperty

dwciri: properties (43 results. Note: this includes the seven dwciri: properties having no dwc: analogs)

prefix dcterms: <http://purl.org/dc/terms/>
prefix dwc: <http://rs.tdwg.org/dwc/terms/>
prefix dwciri: <http://rs.tdwg.org/dwc/iri/>
select distinct ?iriterm
from <http://rs.tdwg.org/>
where {
  ?iriterm rdfs:isDefinedBy dwciri:.
  minus
  {  ?iriterm owl:deprecated "true"^^xsd:boolean.}
}
order by ?iriterm

dwc: properties that have dwciri: analogs (36 results)

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix dwc: <http://rs.tdwg.org/dwc/terms/>
prefix dwciri: <http://rs.tdwg.org/dwc/iri/>
select distinct ?dwcterm
from <http://rs.tdwg.org/>
where {
  ?iriterm rdfs:isDefinedBy dwciri:.
  ?dwcterm rdfs:isDefinedBy dwc:.
  ?dwcterm rdf:type rdf:Property.
  bind (str(?iriterm) as ?string)
  bind (substr(?string,28,strlen(?string)-27) as ?localname)
  filter (strends(str(?dwcterm),?localname))
  minus
  {  ?dwcterm owl:deprecated "true"^^xsd:boolean.}
}
order by ?dwcterm

dwc: terms without dwciri: analogs (142 results)

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix dwc: <http://rs.tdwg.org/dwc/terms/>
prefix dwciri: <http://rs.tdwg.org/dwc/iri/>
select distinct ?dwcterm
from <http://rs.tdwg.org/>
where {
  ?dwcterm rdfs:isDefinedBy dwc:.
  ?dwcterm rdf:type rdf:Property.
    minus {
    ?iriterm rdfs:isDefinedBy dwciri:.
    ?dwcterm rdfs:isDefinedBy dwc:.
    bind (str(?iriterm) as ?string)
    bind (substr(?string,28,strlen(?string)-27) as ?localname)
    filter (strends(str(?dwcterm),?localname))
  }
   minus
   {  ?dwcterm owl:deprecated "true"^^xsd:boolean.}
}
order by ?dwcterm

The advantage of using these query results rather than sorting by hand is not only that you won't make typos and other human errors, but also you can just re-run them anytime DwC is updated. The Vanderbilt SPARQL endpoint is loaded with this script whenever the TDWG metadata is updated, and anyone could use that script to load their own triplestore.

baskaufs commented 4 years ago

I also just wanted to clarify one thing. The lack of "axiomization" of DwC isn't an oversight, it's by design. In 2010 there was a very lengthy discussion about how or whether DwC should be turned into an ontology (as well as many other related issues). There's a summary of the discussion here. One opinion that was expressed repeatedly (see this, this, this, this, and this post for examples) was that it was probably a good idea to keep the essence of Darwin Core free from complex assertions and then build more complex layers on top of that basic layer. That approach was taken from the start with Audubon Core, which was billed as "representation agnostic" as a basic set of terms with the potential to explain how to use it with RDF and other technologies later via best-practices guides.

This approach was codified in Section 4.4.2.2 of the TDWG Standards Documentation Specification, which directed that term properties that generate machine-computable entailments should be asserted in a separate document from the one that asserts the basic properties. So far, no one has done this and gotten a higher layer ratified as an addition to any TDWG Vocabulary. Thus when you acquire the machine readable metadata for TDWG terms, no axioms are there.

The TDWG Vocabulary Maintenance Specification outlines the process for developing such "enhancements" in Section 4. In a nutshell, before a proposed enhancement (such as an ontology add-on or application profile) can be adopted, the proposers must produce a feature report (such as a list of competency questions) that describes what the enhancement should accomplish. After the creation of the enhancement, the proposers must produce an implementation experience report showing which of the proposed features were successfully implemented. This is similar to the processes required for standards development by our aspirational peer standards organizations like the W3C and IETF. So far, no one in TDWG has gone through this process, but several current task groups are in various stages of the process.

I mention these requirements in this context because it seems that frequently people who want to communicate how RDF should be used do so by creating ontologies. For example, if they want to say that a particular property should have a subject that is a member of a certain class, they use a domain assertion. But when an ontology user asserts a subject that's of the wrong class, the ontology doesn't prevent that, it just entails that the subject resource is a member of another class (the one specified by the domain declaration). There are ways to be tricky using disjoint statements so that a graph is rendered inconsistent when people use the ontology in the wrong way. We did that in Darwin-SW. But we wrote Darwin-SW before the Shapes Expressions (ShEx) language was developed and I now feel that ShEx is a much more appropriate way to try to "control" how people use RDF.

My point is that the appropriate tool for solving a problem depends on the definition of the problem, hence the requirement in the VMS for a feature report. I'm not sure exactly what the ultimate goals of BCO are, but I'd encourage you to lay out a clear set of use cases if you want it to be used more widely in the TDWG community. You may have already done that, but I haven't prowled around in the BCO documentation recently to look for it.

ramonawalls commented 4 years ago

Thanks for the links, @baskaufs. They are exactly what I needed.

I fully understand about the lack of axiomization in DwC. I suppse it is useful to have that information here - never know who is likely to read this post.

We have multiple active use cases for BCO with DwC, primarily for integrating continuous data from multiple sources, for which we use DwC as data properties. I admit is not well documented, since we have been focusing on making the pipeline work internally. However, we are now working on a new manuscript that will describe the full process, if anyone else wants to use it.

Relative merits of shex/shacl versus ontologies is WAY outside the scope of this issue.

ramonawalls commented 4 years ago

http://rs.tdwg.org/dump/terms.ttl http://rs.tdwg.org/dump/iri.ttl

ramonawalls commented 3 years ago

Imports have been added, and I moved the code to this repository, but it still needs some work. I will close this issue and create new ones for the code.