dcmi / usage

DCMI Usage Board - meeting record and decisions
8 stars 5 forks source link

Broken ranges? #32

Closed tombaker closed 6 years ago

tombaker commented 6 years ago

In 2016, Steven Anderson argued on DC-ARCHITECTURE that the following properties had assigned ranges different from ranges used in practice:

In response, Makx pointed out that people do whatever they want, which should be okay in closed environments, "but if you want your stuff to make sense in a more open environment, you'd better do things the way they were supposed to be done".

Antoine reported with regard to Europeana data:

We need the flexibility, as our data is mixed: we sometimes receive literals, sometimes URIs. And it seemed very bad to have to use two different properties. And there was no "created" in the DC1.1 namespace, that doesn't have formal ranges. Maybe putting this formal range on dcterms:created was not the best decision, at the time it was created. Note that the properties at schema.org also have a softer approach to 'ranges'. After some discussion they ended up using the notion of 'expected types', which is much more flexible [1] I'm not saying that control is bad per se. On the contrary, in many cases it is crucial, see the ongoing work in the DCMI RDF AP Task Group [2] (and we do control some things in Europeana). But having elements controlled too strictly at the level of the vocabulary/ontology may be counterproductive (in the Semantic Web community some use the notion of 'semantic overcommitment to refer to this issue).

osma commented 6 years ago

I've had my share of these issues (e.g. creating the application profile for HealthFinland metadata circa 12 years ago - my first SW project!) and often had to resort to DC 1.1 because the DCTerms ranges seemed too strict.

I'm a fan of schema.org expected types. This is what they say on the Getting Started page:

Expected types vs text. When browsing the schema.org types, you will notice that many properties have "expected types". This means that the value of the property can itself be an embedded item (see section 1d: embedded items). But this is not a requirement—it's fine to include just regular text or a URL. In addition, whenever an expected type is specified, it is also fine to embed an item that is a child type of the expected type. For example, if the expected type is Place, it's also OK to embed a LocalBusiness.

I wonder if we could do something similar with DC property ranges?

kcoyle commented 6 years ago

Osma, I think if we want to remain RDF-compliant we would not assign property ranges and the "expected type" would be a note. It looks like schema.org does this by having only sub/sup relationships in the ontology and the "expected types" are part of the documentation files. The documentation refers to them as "notes"[1]. As usual, the actual RDF files are well hidden and I gave up after a few minutes searching.

[1] http://schema.org/docs/gs.html#schemaorg_expected

osma commented 6 years ago

@kcoyle schema.org uses domainIncludes and rangeIncludes properties, so it's not just notes. And this is RDF compliant, but not really RDFS and certainly not OWL. But other than that you're right. This would be more of a recommendation for the range, not a rule that absolutely must be followed.

kcoyle commented 6 years ago

Thanks, Osma - now I see the properties. They are under the schema.org namespace. I'm a bit nervous about adding schema.org namespace to DCMI terms ontology. I realize it has great traction but I'm concerned about the proprietary nature of it, and that it hasn't gone through one of the more open standards processes. What other options do we have today? Notes?

osma commented 6 years ago

First of all we need to decide between three directions:

  1. Making the current property ranges stricter (e.g. making them owl:ObjectProperty instances)
  2. Keeping them defined as they are, just adjusting notes
  3. Loosening the definitions to match actual usage, perhaps similar to how schema.org does it

I like option 3 as noted above. I think option 1 is a no-go for many reasons so I just added it for completeness.

If we go for option 3, then I think using something like schema:rangeIncludes is better than just documentary notes, because it provides at least a little bit more machine-readability. We may not want to use the schema.org properties directly (I wouldn't object).

tombaker commented 6 years ago

On Sun, Jun 03, 2018 at 10:24:13AM -0700, Osma Suominen wrote:

First of all we need to decide between three directions:

  1. Making the current property ranges stricter (e.g. making them owl:ObjectProperty instances)
  2. Keeping them defined as they are, just adjusting notes
  3. Loosening the definitions to match actual usage, perhaps similar to how schema.org does it

I like option 3 as noted above. I think option 1 is a no-go for many reasons so I just added it for completeness.

I also prefer option 3. As I see it, we would need to do the following:

What story would we tell? Something that starts with: "Our understanding of Web semantics has evolved over the past decade. Experience shows that..."?

If we go for option 3, then I think using something like schema:rangeIncludes is better than just documentary notes, because it provides at least a little bit more machine-readability. We may not want to use the schema.org properties directly (I wouldn't object).

Using a Google namespace could raise issues with the ISO process; I'd have to ask. On the other hand, the ISO process does not require us to use URIs for things like name and label, so we could perhaps simply define them in the ISO draft with words, on our own authority.

I see no reason we couldn't create equivalents in /terms/ and map them to Schema.org, though I have some hesitation about adding them to /terms/ and not, for example, /dcam/ or some other DCMI namespace, as they would be the only two properties in /terms/ with domains of Property.

osma commented 6 years ago

Like @tombaker I don't think that domainIncludes and rangeIncludes would fit under /terms/. Using another namespace would be preferable. They are on a different abstraction level.

I checked on LOV whether there are any other vocabularies that define a rangeIncludes property, and discovered OLCA (Ontology Loose Coupling Annotation), a small ontology originally created to describe the Lingvoj linguistic data set. Unfortunately there are problems with URIs not resolving, apparently due to lingvoj.org having moved under another domain, linkedvocabs.org. After some educated guessing I found the definition of the OLCA properties here: http://linkedvocabs.org/lingvoj/olca_v1.0.ttl

The OLCA vocabulary/ontology is:

A vocabulary defining annotations enabling loose coupling between classes and properties in ontologies. Those annotations define with some accuracy the expected use of properties, in particular across vocabularies, without the formal constraints entailed by the use of OWL or RDFS constructions

The olca:rangeIncludes property is described like this:

A loose coupling of a property to possible or expected values. This annotation is to be used when one does not want to enforce formally the coupling by rdfs:range or some owl:Restriction constraint.

Do you think adapting or adopting OLCA (instead of schema.org) would make sense? We would have to work with the authors to fix the URI resolution issues.

osma commented 6 years ago

Blog post from 2013 about OLCA also mentions it as a possible solution for DC range issues

kcoyle commented 6 years ago

Tom, is there a specific problem with the Google namespace? Or is the problem with using something that is not in the DCterms namespace?

tombaker commented 6 years ago

@kcoyle I'm wondering if ISO would accept it if we were to reference a Google namespace. On the other hand, we do not provide URIs for other elements that are descriptive of the terms in ISO 15836, such as "name" and "label", so for the purposes of the ISO standard, I think we could simply define them in words ("on our own authority"). That still leaves the question of what URIs we would want to use when publishing DCMI Metadata Terms in RDF - Schema.org, OLCA, or some DCMI namespace, all three of which are possible.

osma commented 6 years ago

Agree - I don't think the particular URIs we choose are of concern to ISO, as they would only appear in the RDF data published by DCMI, not in the ISO standard itself. We could use something like "expected type" in the documentation, as schema.org does.

tombaker commented 6 years ago

@osma Nice blog post! I'm not sure I like the word "enforce" in the definition for olca:rangeIncludes. Is there a more formal vocabulary definition somewhere? @kcoyle - you did not find a formal vocabulary definition for schema:rangeIncludes, correct?

osma commented 6 years ago

schema:rangeIncludes is defined here. The embedded triples are:

schema:rangeIncludes
    schema:domainIncludes schema:Property ;
    schema:isPartOf <http://meta.schema.org/> ;
    schema:rangeIncludes schema:Class ;
    a rdf:Property ;
    rdfs:comment "Relates a property to a class that constitutes (one of) the expected type(s) for values of the property." ;
    rdfs:label "rangeIncludes" .

There's a bit of a chicken-and-egg situation here where the rangeIncludes definition uses the property itself to relate it to the intended range (schema:Class).

osma commented 6 years ago

Here is the original proposal to define schema:domainIncludes and schema:rangeIncludes properties, which explains the idea in a bit more detail. (This was also linked from the blog post mentioned above)

It appears that the schema.org properties were intended just for defining the schema.org vocabulary itself, not necessarily for use by the wider world, unlike OLCA which aspires to be a vocabulary that anyone could use to define their property domains and ranges.

tombaker commented 6 years ago

@osma @kcoyle For rangeIncludes, it looks like a choice between:

I slightly prefer the more straightforward Schema.org definition but could go either way. What if anything do we know about the persistence plans for the OLCA vocabulary? Would the proposal be to replace all rdfs:range declarations with rangeIncludes (and likewise for domain declarations for the collection-related properties)?

If so, the conversion would not be completely mechanical. In the case of dct:coverage, for example, we could take the opportunity to deprecate the hideous dct:LocationPeriodOrJurisdiction class and replace it as a range with three separate rangeIncludes: Location, PeriodOfTime, and Jurisdiction.

In addition, I would argue for replacing dct:subject rdfs:range rdfs:Class with dct:subject x:rangeIncludes skos:Concept.

If get a few more expressions of support in this thread, we could already start to formulate this as a proposal.

osma commented 6 years ago

@tombaker The third choice would be coining a new property, e.g. dcam:rangeIncludes. While I'd prefer reusing already coined properties, neither of the existing rangeIncludes properties looks like an obvious winner as both have their issues.

I don't know much about plans for OLCA. It seems not to be very widely used, and the URL issues indicate that it is perhaps not being very actively maintained. I contacted the person mentioned at the top of the Lingvoj home page about the URL issues - no reply yet.

+1 for the proposal ideas!

tombaker commented 6 years ago

@tombaker The third choice would be coining a new property, e.g. dcam:rangeIncludes. While I'd prefer reusing already coined properties, neither of the existing rangeIncludes properties looks like an obvious winner as both have their issues.

I don't know much about plans for OLCA. It seems not to be very widely used, and the URL issues indicate that it is perhaps not being very actively maintained. I contacted the person mentioned at the top of the Lingvoj home page about the URL issues - no reply yet.

@osma If that is the case, and if the Schema.org properties are not actually promoted for wider use, I see no problem with coining two dcam: properties (domainIncludes and rangeIncludes). We could declare them to be owl:equivalentProperty to schema:rangeIncludes and olca:rangeIncludes. We would use these for publishing DCMI Metadata Terms in RDF, but for the ISO standard we would simply define them with words -- to be precise, in Section 3.2 on page 10 of the latest ISO 15836-2 draft. In DCMIMT and in ISO 15836-2, would we want to label these "expected range" and "expected domain" or "range includes" and "domain includes"?

In addition, we would need to formulate both a rationale specifically for changing the ranges and an amendment to the Namespace Policy. For example, can we claim that after ten years, we understand a bit better how ranges are used (or not used)? Have we learned from the example of Schema.org (and if so, what specifically have we learned)? Are the ranges for DCMI Metadata Terms currently defined too tightly? Would we want to allow semantics to be loosened in general? We do not need to write an essay -- just enough to document our thinking, explain the decision in the news feed, and clarify the principles by which we now and in the future can justify semantic changes that involve "changes of meaning ... likely to have substantial impact on either machine processing of DCMI terms or the functional semantics of the terms" without triggering a change of URI.

danbri commented 6 years ago

Ok, a bunch of things going on here.

On the basic idea

First off, I remember @rdaniel at (I think) DC-7 in Finland, the week RDF was announced by W3C, warning us of the dangers of prematurely creating an overly strict types-and-properties model for DC. Seems he was right, even if we waited a few more years before doing so.

I wholeheartedly support the general idea of making the type/property associations for DC terms be more flexible, mix-and-match, loosely coupled, or however else we chose to phrase it.

On the RDFS design

I don't think anyone has written down how we ended up with the current rdfs:domain and rdfs:range design, here's an attempt.

I was involved (as issue list maintainer and spec editor) in the W3C RDFS group from Oct 1997 until it fizzled out around 1999-2000 when W3C management lost confidence in RDF due to the general industry and advisory committee preference for XML. During that period the design went back-and-forth between the current rdfs:domain and rdfs:range, and variations on that approach. The WG discussions at the time were secret to W3C members i.e. pay-to-play, and the WG archives remain secret afaik until the end of time. I'll summarize here.

e.g. in August 1998 draft, https://www.w3.org/TR/1998/WD-rdf-schema-19980814/

That the value of a property should be a resource of a designated class. This is expressed by the range property type. For example, a range constraint applying to the 'author' property type may express that the value of an 'author' property must be a resource of class 'Person'. That a property type may only be used on resources of a certain class. For example, that a property type of 'author' could only originate from a resource that was an instance of class 'Book'. This is expressed using the domain property type.

In the earlier April 1998, https://www.w3.org/TR/1998/WD-rdf-schema-19980409/ by contrast there was an awkward attempt to be more more "object oriented", and instead of "domain" we named the property in the opposite direction and said:...

That the resource may have properties of a given property type. For example, that a resource of class 'Book' may have a property of type 'author'. This is expressed using the allowedPropertyType property type. This constraint allows one -- in effect -- to implement domain-constraints for property types.

This was always an awkward fit with the open world RDF model, since we wanted types to also support unanticipated properties defined by others (taking on board DC's "Warwick Framework" concerns). The resulting compromise within the RDFS WG was that rdfs:range in 1999/2000 ended up implying type membership, while rdfs:domain was left as indicating only a weak hint of a type/property association.

In March 2000's Candidate Recommendation, https://www.w3.org/TR/2000/CR-rdf-schema-20000327/#s3.1.3 said

[...] used to indicate the class(es) on whose members a property can be used. A property may have zero, one, or more than one class as its domain. If there is no domain property, it may be used with any resource. If there is exactly one domain property, it may only be used on instances of that class (which is the value of the domain property). If there is more than one domain property, the constrained property can be used with instances of any of the classes (that are values of those domain properties).

We explicitly noted the asymmetry:

Note: This specification does not constraint the number of rdfs:domain properties that a property may have. If there is no domain property, we know nothing about the classes with which the property is used. If there is more than one rdfs:domain property, the constrained property can be used with resources that are members of any of the indicated classes. Note that unlike range this is a very weak constraint.

This design was in place in the "Proposed Recommendation" of March 1999, https://www.w3.org/TR/1999/PR-rdf-schema-19990303/#constraints ("If there is more than one domain property, the constrained property can be used with instances of any of the classes (that are values of those domain properties).").

In May 1999, the expectation was that this design was going to be blessed by W3C as a Recommendation. In fact I flew to Toronto for the WWW8 conference with the expectation that we would see that announced during the conference. However... that was not to be. Members of the newish XML Schema WG had raised an alarm that RDFS was about to be made a REC, and their objections led to RDFS being put on hold while the relationship between RDF and XML was reconsidered. W3C Members can read the archived threads (or here, and here).

The result of this (other than the "cambridge communique") was that W3C effectively dropped out of WG-track work on RDF for about 2 years, and the effort was kept alive largely by contributors in the public W3C Interest Group. The March 2000 "Candidate Recommendation" was put together after the RDFS WG had effectively stopped meeting, and things went on hold. It summarized the final state of the RDFS design and encouraged implementor feedback. We then set up a public issue tracking page, which Brian McBride from the HP Jena team played a leading role in coordinating. When we finally got another W3C WG for RDF in March 2001, Brian became the lead chair of the RDF Core WG, whose charter we (I co-chaired and had joined W3C team as a visiting engineer by this point) chartered to

[...] address questions and issues raised on the public comments feedback list and the RDF Interest Group list during the Candidate Recommendation period and will produce an updated W3C specification.

In practice, our strongest feedback on the rdfs:domain matter came from TimBL who objected to the 1999 design on the basis that the property as designed was strictly meaningless because it did not license any inferences. This view bubbled out of the W3C MIT SWAD discussions into public space, originally via Ralph, and then via TimBL. I was in @timbl's MIT office when these things were discussed around a whiteboard. At the time, Semantic Web was strongly identified with inference and logic, and so the weaker association of "this property kinda sorta goes with this type" was frowned upon:

Ralph:

At best, the domain property we've defined permits determination that no known constraints have been violated. This is what the Working Group intended as far as I can tell, largely at my own recommendation. But I'm having second thoughts.

I haven't had a chance to examine other implementation work in detail to see how people have used rdfs:domain. At a minimum, it might be appropriate to change its name so that it is more clearly distinguished from rdfs:range which does allow inferencing.

TimBL:

would like to reinforce Ralph's mild comments more stongly. The current wording implies that the subject of a property can be in any class for which rdfscr:domain(p,s) applies. What can one tell from the assertion rdfscs:domain(p,a)? Nothing.
You know that the real:domain of p is some superset of s. In other words, there is some class t where subset(s,t) and real:domain(p,s). However, the universal set is always a superset of s and also is always a real:domain of p in the sense that anything which is the subject of p must be in the universal set.

So this condidtion is always true, so we have learned nothing.

Put another way "may be" does not translate into logic.

Tim BL

This was how we ended up with RDFCore's final RDFS REC having the same approach to rdfs:domain and rdfs:range, rather than strict semantics only for the latter.

Meanwhile, we had also within the RDFS WG had a thread of discussion around the idea of class-specific constraints. See (W3C Member-only) issue. In the original 1997-1999 RDFS WG I think we backed away from this in the interests of stripping the design down to something that had WG consensus. In the later 2001-2004 RDF Core WG we avoided the topic because it would tread on the anticipated toes of the newer Web Ontology (OWL) WG, who wanted to make a more powerful framework for defining RDF vocabularies.

Why mention all this?

FWIW the original W3C RDF and RDFS design was heavily based on Guha's older MCF work, which via Netscape was submitted to W3C in 1997 (see data model overview). MCF-in-XML as far as I can see, treated range and domain the same way as the eventual RDF Core design.

On using schema.org's domainIncludes and rangeIncludes properties

There are a few reasons you see weakened expectations around type/property associations in Schema.org. One is that in 2011 at launch, there was zero Schema.org markup in the public Web, and a long history of RDF projects failing. The design therefore tilted strongly towards making things easy for publishers even if it made a bit more work for consumers. That accounts for the tone and wording you see in the getting started page. Secondly, Schema.org is both fairly large, and very cross domain. Consequently we can't afford to add into this large set of terms anything that is purely motivated by technology artifacts, i.e. being forced to create a fairly bogus ThingThatHasDuration type, just so rdfs:domain can imply that movies, events, music releases, etc are all acceptable places to see a duration property appear. Thirdly, the vocabulary is a living changing thing. We keep improving it and cannot always be 100% sure where we will tweak things. As such it seems more appropriate to set weak expectations about inferring types from properties. Besides, inferring a type from the presence of a property is generally pretty boring. It is much more interesting to infer things based on identity reasoning. Rather than concluding "oh, X must be a ThingThatHasDuration" it is much more fruitful to try to conclude "ah, these two descriptions are referring to the same entity".

I wouldn't worry about using schema.org's domainIncludes and rangeIncludes properties; there is not a great deal of value in re-use here, although it would be harmless enough to do so. The only reason we have them (and have them hidden away in our "meta" area), is because the whole schema.org site is mechanically generated from RDF/RDFS(ish) definitions, and so we needed something that could be used in the definition files.

On the "domainIncludes" / "rangeIncludes" aren't standard concern

Karen wrote,

I realize it has great traction but I'm concerned about the proprietary nature of it, and that it hasn't gone through one of the more open standards processes.

Since we don't need to use schema.org/domainIncludes etc here, this is just an aside, but since you raise it:

Having been involved in almost all aspects of this story since 1997 (MCF implementor; RDFS issue list and spec editor; RDF Core co-chair; RDFIG chair; Schema.org dogsbody, ...), I can hand-on-heart say that the design discussions we've had at Schema.org have been substantially more inclusive, consultative, responsive to implementors of all kinds, than the series of historical accidents that led to the current and somewhat arbitrary rdfs:domain and rdfs:range design (or for that matter FOAF, which was a bunch of friends on a mailing list trying to build something interesting). We did our best in the old RDF/RDFS and RDF Core groups but I wouldn't put them on a pedestal above the more public-participation efforts at DC, FOAF and Schema.org. In many cases it's the same people involved anyway...

rdaniel commented 6 years ago

@danbrihttps://github.com/danbri said:

I remember @rdanielhttps://github.com/rdaniel at (I think) DC-7 in Finland, the week RDF was announced by W3C, warning us of the dangers of prematurely creating an overly strict types-and-properties model for DC.

Dan's memory is better than mine, but it sounds like something I would have done. Certainly, books and authors is my favorite example of bad domain and range constraints since not all authors are persons and not all authored things are books.

I'll also echo his mention that we not put the RDF committee on a pedestal. If only I knew then … ¯_(ツ)_/¯

Best regards,

Ron Daniel Jr., Ph.D. Director, Elsevier Labs +1 619 208 3064 (m)

[2018-06-05: @tombaker edited post below this line to delete redundant copy of @danbri posting.]

kaiec commented 6 years ago

Wow, I really learned a lot from this thread, thank you. I also support the solution to coin dcam:domainIncludes/rangeIncludes. I think it perfectly fits into the DCAM namespace and if we extend and redefine DCAM so that it actually fits the practical use of dcterms, it might after all even get some traction (at least we are not the only ones who have problems with the strictness of RDFS/OWL).

makxdekkers commented 6 years ago

If I may, I'd like to bring up what I think is the consequence of a decision to relax range declarations of properties. One possible, and I think likely, result is that a lot of data providers will stop bothering about strings vs. things. For example, I am not sure that anyone will still bother to identify people and organisations with URIs; after all, a publisher might have a name at hand and would have to look up (or even worse, create) a URI, which costs time and money. That's back how things were done in XML and even before that, and it brings back the usual problems of spelling errors, transliteration and clashes. I agree with Dan's characterisation that it makes things easier for publishers at the expense of making things harder for consumers of data. In the case of schema.org, the set of intended consumers is small (the main search engines) and they have an interest in making the best of data out there. But other consumers will also need to include branches in their software to cope with two options in data they consume -- this raises the question how to match incoming data; for example, what does a consuming application need to do to find out who "Tom Baker" is: https://en.wikipedia.org/wiki/Tom_Baker, https://en.wikipedia.org/wiki/Tom_Baker_(American_actor), etc.? This was the problem that identification with URIs was supposed to solve. Also, it throws away the whole idea of follow-your-nose, because if most of the data are strings, there's nothing to follow. Is it then that we're really taking the view that we can do away with the "semantic" in the Semantic Web and the "linked" in Linked Data? I'm not arguing it's "wrong" to abandon these ideas, just that we might want to be explicit about it.

kcoyle commented 6 years ago

My gut feeling is that people provide URIs because they want to do something with them, not because of how a property is defined. I agree, though, that defining an object property is a "push" toward the use of identifiers.

However, Makx brings up an important question which is: what is the likely impact of such a change? Do we assume that when the property is an object property that consuming software will reject/ignore any non-IRI values? What do we expect users who do not have identifiers for their desired values to do? Should we recommend that they only use DC1.1? This is where definitions and notes could make a big difference. Another possibility is a re-write of the Guidelines document, originally written by Diane Hillmann and probably considerably out of date at this point. It could be written like a primer.

osma commented 6 years ago

@makxdekkers Point taken. I think the problem is that even the current ranges are not respected, and in many cases (as in the opening post) the range declarations are rather awkward, as explained in the initial post above. Do you have any dct:SizeOrDuration instances in your system?

I'm not arguing that we should abandon URIs, follow your nose etc. The rangeIncludes statements are meant to say what the intended type of value is, but without forcing inferences that would cause problems if it's actually not an instance of that type. I seriously don't think what DCterms says has much impact on what data providers do w.r.t. identifying things by URI or not. A lot of supposedly Dublin Core metadata is not expressed as RDF at all; I think that's why we are getting many literal example values from the ISO side that we have to deal with (in other issues in this github repo such as #21 and #22).

makxdekkers commented 6 years ago

@osma OK, I think you're arguing that it's fine for people to use strings -- because that's what people already do. I acknowledge that there is data out there that does not respect the rules. But if feels to me like abandoning traffic lights because a lot of people jump the red light. It feels like giving up on Linked Data.

osma commented 6 years ago

@makxdekkers This cuts both ways. From the OP:

dct:created: range of literal, but used by Europeana and UC Santa Barbara with EDM:Timespan objects.

So should we now complain to Europeana and UC Santa Barbara that they use things when they are supposed to use strings?

With all due respect, I don't think the rules for many of the DC properties were very good in the first place (unlike traffic lights). RDFS is an awkward language for data validation purposes (and OWL is much worse!) because of the way domains and ranges work - mostly inferring new statements instead of providing helpful validation functionality. So if you "violate" a domain or range you will often just get nonsensical inferences but no errors per se. This makes domain/range statements dangerous - you never know what will happen if you violate them, so better not try - maybe find another property or coin your own just to be safe.

I think we should encourage data publishers (and especially AP authors) to use a model that works for themselves and, as much as possible, to the rest of the world. Identifying things by URIs is obviously a part of that. But the current domains and ranges, specified with RDFS, are not helpful. I've personally several times resorted to dc1.1 properties or coined my own because the dcterms ranges were too strict for the purpose.

makxdekkers commented 6 years ago

@osma Respect gracefully taken ;-) And duly returned!

Yes it cuts both ways. Using a range for the creation date is wrong according to the current specification, and it will confuse consuming applications that have no way of handling such ranges.

I agree it is really a case of being accomodating toward data publishers and putting the burden on the consumers -- maybe from the perspective that, if consumers are really interested in the data, they should bear the burden to make sense of it.

It's somehow like a reverse Postel's law.

I understand the argument. I am just saying that DCMI may want to be explicit about abandoning one of the leading principles that led to the development of dcterms alongside the original Dublin Core Metadata Element Set.

danbri commented 6 years ago

On @makxdekkers 's point,

For example, I am not sure that anyone will still bother to identify people and organisations with URIs; after all, a publisher might have a name at hand and would have to look up (or even worse, create) a URI, which costs time and money. [...] I agree with Dan's characterisation that it makes things easier for publishers at the expense of making things harder for consumers of data. In the case of schema.org, the set of intended consumers is small (the main search engines) and they have an interest in making the best of data out there.

From the schema.org perpective, we would certainly like more consumers, and have over time gradually moved away from the initial "anything goes" perspective, introducing somewhat more reliable structures. Eventually though, we have taken an approach that aligns to the messy reality of the data that is out there, rather than the data that we wish were out there. Dublin Core, also, always positioned itself as a pragmatic and realistic common representation. I learned the phrase "what you do in your own database is your own business" at a DC conference. If those databases contain entity-oriented models and well known identifiers, absolutely we should provide a good practice recommendation for encoding that (in both/either DC and Schema.org).

On this point: "...would have to look up (or even worse, create) a URI, which costs time and money", I think @makxdekkers expresses well the opposite point to that apparently intended. It is not our job here to use the technical tools of schema definition to force metadata publishers to spend time and money. We ought to gently encourage improvements in data publication, curation and cleanup, rather than shut out those parties who are custodians of messier or more ambiguous data and who can't reach the higher standards of ideal Linked Data practices.

More carrot, less stick, I'd suggest. With the likes of ShEx-and-SHACL available, it ought to be possible to document preferred data patterns (as "application profiles" etc.), while the underlying dictionary of terms allows for more varied forms of expression. If I can be excused another culturally specific metaphor, this "double decker" approach to schema specification allows neats and scruffies to go on the same journey. Scruffy data can get value, inclusiveness and extensibility from common dictionaries of terms (RDF/RDFS); neater data can be policed more vigorously using the emerging RDF validation languages. More than this, we ought to be able to use those new formats to document specific concrete/practical incentives for metadata publishers to spend that time and money on cleanup.

I don't think this is giving up on the Linked Data idea. There always was a pragmatic side to the RDF community, prior to the "Linked Data" slogan: Dublin Core, FOAF, SKOS etc were the ancestors to the Linked Data reboot of Semantic Web. There was hypertext RDF before it got called Linked Data. Unfortunately, the later Linked Data community picked up their own kind of religiously strict thinking, and it became taboo to publish mentions of entities without manually tying them to DBPedia or similar. I suggest that this side of Linked Data has had value but also made publishing needlessly costly, and for as yet under-demonstrated benefits. I'm arguing only for a return to RDF's original pragmatism.

The first talk I ever gave on RDF was at @makxdekkers's invitation; it showed bnodes for e.g. person entities while encouraging a move towards common URIs where they make sense and can be agreed. I think we're still heading in that direction, and that we won't get there any faster by disallowing URI-less data to be expressed via DC terms.

makxdekkers commented 6 years ago

Allow me to get a bit excited about this issue. Discussions with @danbri often do that to me, something I greatly enjoy.

What I'd like to bring into the discussion is the issue of requirements and objectives.

In my mind, if you have an environment where there are many data publishers, who are completely independent and all over the place, that have varying levels of semantic commitments in the data they publish, and a central organisation that is willing to consume the heterogenous data collection and make sense of it across the collection based on the volume of data collected, it's in everybody's interest that you work with a very lightweight commitment. The consumer (e.g. a search engine) is happy to harvest as much stuff as possible, and the publishers are happy that someone out there tries to use their data and maybe even enrich it. In addition, the central consumers may not necessarily be interested in the "best" result in a semantic sense -- for example, if they can't resolve all the "Tom Baker"s, they may still be able to provide a useful service based on a best guess.

If, on the other hand, you have an environment where data publishers know their stuff, e.g. they know which Tom Baker is being referenced, and where consumers are interested in high-quality data that is interoperable across a group of publishers, you may want to require a certain level of semantic commitment in order for a particular kind of service to be possible. For example, getting things identified by URIs makes it possible to connect data without having to rely on analysing context.

I think that DC -- and related vocabularies like DCAT -- work in the second type of environment that supports interoperability, while schema.org is very successful in the first environment that supports SEO.

My worry is that if DCMI decides to remove, or at least very much reduce, the recommended level of semantic commitment in DC, it might become more useful in the first environment, at the expense of being less useful in the second, sacrificing interoperability to gain visibility in search engines.

I have always thought that DC was mainly about interoperability, which in my mind requires some level of commitment -- including distinguishing between strings and things and relying on unique identifiers for things. If DC is no longer primarily about interoperability, I agree it could make sense to relax the rules to allow DC to play better in the "general" web.

My question then becomes: if DC ends up specifying a similar level of semantic commitment as schema.org, why maintain both?

danbri commented 6 years ago

@makxdekkers - perhaps some of this discussion would be better conducted over beers at the DC conference? In the meantime I'm afraid I should say that this characterization feels to me to do a mis-service to those many data publishers who as professionals, absolutely know their stuff, but still carry the burden of having incredibly complex, heterogenous, legacy, or otherwise problematic metadata collections. All the expertise in the world isn't going to fix their data. The issue isn't between "SEO harvesting" versus high quality interoperability, it's about the appropriate levels at which to characterize useful data patterns.

Dublin Core always was about the general problem and the general Web. As a community, we may not have succeeded in our original ambitions, and have instead found deeper adoption amongst public sector and cultural heritage informaticians, but it feels inaccurate to say that Dublin Core never tried. "Better To Try And Fail Than Never To Try At All", etc.

1.2 Scope The size and complexity of the resource description problem required limiting the scope of deliberations. Given that the majority of current networked information objects are recognizably "documents", and that the metadata records are immediately needed to facilitate resource discovery on the Internet, the proposed set of metadata elements (The Dublin Core) is intended to describe the essential features of electronic documents that support resource discovery. Other important metadata elements, such as those describing cost accounting or archiving information, were excluded from consideration. It was recognized that these elements might be included in a more complete record that could be derived from the Dublin Core by a well-defined extension.

1.3 The Intended Niche The Dublin Core is not intended to supplant other resource descriptions, but rather to complement them. There are currently two types of resource descriptions for networked electronic documents: automatically generated indexes used by locator services such as Lycos and WebCrawler; and cataloging records, such as MARC, created by professional information providers. Automatically generated records often contain too little information to be useful, while manually generated records are too costly to create and maintain for the large number of electronic documents currently available on the Internet. Records created from the Dublin Core are intended to mediate these extremes, affording a simple structured record that may be enhanced or mapped to more complex records as called for, either by direct extension or by a link to a more elaborate record.

makxdekkers commented 6 years ago

Beer sounds good. However, I would still want to see further discussion on the issue related to requirements and objectives on a public platform, to make sure there is a wider consensus and understanding of the consequences..

kcoyle commented 6 years ago

@makxdekkers I don't think that we can assume that DC terms (used widely and in vastly different situations) can themselves meet specific objectives, and in particular can meet those objectives through RDF/RDFS ontology definitions. Those definitions only provide inferencing axioms for domains and ranges. The RDF definition of DC terms cannot tell you important information about the metadata usage such as cardinality of properties, validity rules for values, etc. While it may be advantageous to a consumer of metadata to expect a URI value, that doesn't help you if you receive an entire dataset with "http://example.com/" as values. RDF domains and ranges are not sufficient for data validation. The only use I see of domains in data that I observe is in using sub-classing to facilitate SPARQL queries. Note that classes can be assigned in instance data, and therefore do not need to be baked into the vocabulary.

I think that the RDF domain/range is of little utility for us because it only speaks to inferencing, which few in our community make use of, AFAIK. There are few consequences to obtain from domains and ranges. If we want to document expectations of use, then we need application profiles. As we don't have a schema for APs at this time, I would advise either expanding the use of notes (they are quite terse at the moment) and/or providing a separate guidance document to help people learn "best practice" usage.

tombaker commented 6 years ago

@kcoyle

I would advise either expanding the use of notes (they are quite terse at the moment) and/or providing a separate guidance document to help people learn "best practice" usage.

Now that the website has been migrated to a static site generator, using Markdown files, which is making it easier for us to restructure the whole website (the results of which will become visible before DC-2018), Paul and I want to set up the site to generate HTML and RDF representations not just for DCMI Metadata Terms as a whole, but also for individual properties and classes, one term per page. We are exploring ways to crowd-source usage pages (e.g., through pull requests), which would be merged into the individual term pages every time the site is built. A separate guidance document may also be needed, but that document, too, could potentially be generated from (the same) separately maintained pieces.

makxdekkers commented 6 years ago

@kcoyle I am not talking about inferencing which I agree is not used much. I am just wondering where the added value is of DC over schema.org. Looking at it from my own perspective, if there was a question from a data publisher whether to use DC or schema.org, and both DC and schema.org required a similar level of semantic commitment, I might be inclined to suggest using schema.org as that has the additional benefit of being recognised by search engines. My earlier question was why would we need both if they are very similar? I have always used the argument that the additional semantic commitments in DC are a benefit for interoperability, but that argument would be voided if DC lowered the bar on commitments.

kcoyle commented 6 years ago

@makxdekkers If your goal is discoverability by search engines with web pages, then definitely you would use schema.org. DC, AFAIK, is not given much weight by search engines. If you are defining your own metadata, not HTML markup, then you have a choice, but for some things in the GLAM area schema.org has significant problems (e.g. creative works are the only works; authors are the only creators). So for that I would choose DC.

@tombaker I do not consider a listing, even a nice listing, of terms to be sufficient documentation. Users need an overview, a logical map, some information about motivation and choices. Terms lists are just term lists - like dictionaries, they are great for looking things up, but not much beyond that.

tombaker commented 6 years ago

@kcoyle I certainly take your point, and if you have specific ideas about how we can improve on the current DCMI Metadata Terms document, I'd love to hear them because it is fairly high on our list to re-design how the terms are presented.

kcoyle commented 6 years ago

I found the LOD4all statistics which can give us some information on the usage of DC terms. I haven't the time this week to give this a more thorough analysis, but if you look at a specific term, e.g. subject you can see that it has been defined in some vocabularies as an object property, in others as an OWL annotation property, and you can sometimes see definitions of the type:

A term describing the topic covered by the BusinessObject or resource. This is provided as free text in an annotation label or as an identifier pointing to a term in a classification scheme.

I looked briefly at dct:contributor and found:

So if we spend some time looking at these stats I think we can get some concrete evidence about actual usage. I don't know if it is possible to get to instance data through this site. Didn't dig down that far. In addition, all of the ontologies using purl.org got an error message from the Internet Archive (which now hosts purl.org). I'll try to remember to check later and see if things come back.

danbri commented 6 years ago

The owl:AnnotationProperty phenomena is an artifact of OWL being uptight.

e.g. in the FOAF schema file we had to mark it this way to stop Protege complaining.

  <!--  DC terms are NOT annotation properties in general, so we consider the following 
    claims scoped to this document. They may be removed in future revisions if
    OWL tools become more flexible. -->
  <owl:AnnotationProperty rdf:about="http://purl.org/dc/elements/1.1/description"/>
  <owl:AnnotationProperty rdf:about="http://purl.org/dc/elements/1.1/title"/>
  <owl:AnnotationProperty rdf:about="http://purl.org/dc/elements/1.1/date"/>
  <owl:Class rdf:about="http://www.w3.org/2000/01/rdf-schema#Class"/>
kcoyle commented 6 years ago

@danbri I agree that OWL makes people say things they really don't want to say. It would be ideal to see some actual data and learn if people are using strings or things with these properties. The question posed here is whether DC terms should be given object ranges or not. My approach is: we should look at how people are actually using them, because declaring a range is not a guarantee of obedience, and may be a hindrance to use. I'd go for dropping ranges from the terms ontology but using something like rangeIncludes to give people helpful hints about usage without implying that what they need to do is "wrong". However, it appears that is a bigger change than we may be considering in this immediate project.

osma commented 6 years ago

@kcoyle Checking actual usage is a great idea! I think the best way to do it would be to examine the LOD-a-lot data set, which aggregates 28B triples from the LOD cloud. LOD4all that you mentioned would be another option - they have a SPARQL endpoint - but I suspect that running a statistical query on that would be too slow and just time out. LOD-a-lot is available as a LDF endpoint which should make it easier to run a heavy query.

I don't have time to do this right now - currently traveling with my family - but after 25th June I could take a look if someone else doesn't do it first (wink wink).

stuartasutton commented 6 years ago

In addition to seeing how it they have been used, consideration has to be given to whether changes to domains/ranges that loosen the semantics will break existing systems that have relied on them. That may be a question that cannot be answers solely through input of the Usage Committee.

danbri commented 6 years ago

Maybe http://lodlaundromat.org/about/ could be interested in collaborating on this?

On Mon, 11 Jun 2018, 08:39 kcoyle, notifications@github.com wrote:

@danbri https://github.com/danbri I agree that OWL makes people say things they really don't want to say. It would be ideal to see some actual data and learn if people are using strings or things with these properties. The question posed here is whether DC terms should be given object ranges or not. My approach is: we should look at how people are actually using them, because declaring a range is not a guarantee of obedience, and may be a hindrance to use. I'd go for dropping ranges from the terms ontology but using something like rangeIncludes to give people helpful hints about usage without implying that what they need to do is "wrong". However, it appears that is a bigger change than we may be considering in this immediate project.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/usage/issues/32#issuecomment-396287940, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKZGTZGCAglGMFelMh6WFYVJpvfxb1wks5t7o8dgaJpZM4UYGq0 .

danbri commented 6 years ago

I see we just both suggested the same project, more or less. I have dropped them a quick note with a pointer to this thread.

On Mon, 11 Jun 2018, 08:45 Osma Suominen, notifications@github.com wrote:

@kcoyle https://github.com/kcoyle Checking actual usage is a great idea! I think the best way to do it would be to examine the LOD-a-lot http://lod-a-lot.lod.labs.vu.nl/ data set, which aggregates 28B triples from the LOD cloud. LOD4all that you mentioned would be another option - they have a SPARQL endpoint - but I suspect that running a statistical query on that would be too slow and just time out. LOD-a-lot is available as a LDF endpoint which should make it easier to run a heavy query.

I don't have time to do this right now - currently traveling with my family - but after 25th June I could take a look if someone else doesn't do it first (wink wink).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/usage/issues/32#issuecomment-396290200, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKZGUqSNLaz6dB3kud6gITmAMWLOk9Nks5t7pCxgaJpZM4UYGq0 .

aisaac commented 6 years ago

About the 'risks' of losing semantic richness if we drop the formal domains/ranges (and voluntary skipping at this stage the debate on whether RDFS was the formalisms we needed in the first stage).

At Europeana we've done the choice of moving back from DCTerms to DC1.1 for the properties that had formal ranges. This was motivated by the fact we didn't want to play bad with DC specifications, but also by the recognition that this wouldn't have helped us to get richer data. If there's nothing rich in the source, well, there's nothing rich anyway.

We could have asked providers to provide bnodes with labels instead, or 'fake' resources with only one label statement (something that is sometimes recommended in the context of Schema.org) but that wouldn't have made our message easier to get through.

So we've embarked on the long journey of promoting notions that could be called 'URIfication' or 'semantic enrichment' in our network, to convince providers of the value of more structured data. And we're still in the middle of that, and will be at it for a while... But there's certain progress.

In this perspective, the fact of having one unique spec forcing the use of resources did not help. In a way we benefited from the (probably not intended) flexibility offered by DCMI having both the unspecified 1.1 (to make data process work) and the more rigid DCterms (to show the ideal situation). But honestly I think that if we have had one DCTerms spec with softer recommendations (even better, recommendations with some usage note about what they're good) this would work as well for our fight to promote LOD across our domain.

danbri commented 6 years ago

"If there's nothing rich in the source, well, there's nothing rich anyway"

This is exactly the point I was trying to make earlier. Even the most sophisticated curators are often managing rather thinly-described records. Demanding structure they don't have is a reliable way to end up with either less data or poor, over-complex data.

On Tue, 12 Jun 2018, 08:43 aisaac, notifications@github.com wrote:

About the 'risks' of losing semantic richness if we drop the formal domains/ranges (and voluntary skipping at this stage the debate on whether RDFS was the formalisms we needed in the first stage).

At Europeana we've done the choice of moving back from DCTerms to DC1.1 for the properties that had formal ranges. This was motivated by the fact we didn't want to play bad with DC specifications, but also by the recognition that this wouldn't have helped us to get richer data. If there's nothing rich in the source, well, there's nothing rich anyway.

We could have asked providers to provide bnodes with labels instead, or 'fake' resources with only one label statement (something that is sometimes recommended in the context of Schema.org) but that wouldn't have made our message easier to get through.

So we've embarked on the long journey of promoting notions that could be called 'URIfication' or 'semantic enrichment' in our network, to convince providers of the value of more structured data. And we're still in the middle of that, and will be at it for a while... But there's certain progress.

In this perspective, the fact of having one unique spec forcing the use of resources did not help. In a way we benefited from the (probably not intended) flexibility offered by DCMI having both the unspecified 1.1 (to make data process work) and the more rigid DCterms (to show the ideal situation). But honestly I think that if we have had one DCTerms spec with softer recommendations (even better, recommendations with some usage note about what they're good) this would work as well for our fight to promote LOD across our domain.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dcmi/usage/issues/32#issuecomment-396637431, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKZGecD37RwzA3B5O_ldMf0KsH5_3j1ks5t7-GmgaJpZM4UYGq0 .

aisaac commented 6 years ago

For clarification for @danbri and others, I am not claiming my message raises original points. There were so many points that I could agree with in this mammoth thicket ;-) so I've given up and just tried to express some Europeana-related views, without making all due connections, sorry for that!

tombaker commented 6 years ago

Moving discussion (with a proposal) to Issue #43.