canonical vs lexical names

RDARegistry / RDA-Vocabularies

http://www.rdaregistry.info

63 stars 16 forks source link

canonical vs lexical names #38

Closed VladimirAlexiev closed 10 years ago

VladimirAlexiev commented 10 years ago

(This is a hard one, and I expect it will be rejected, but I still want to voice my concern.)

If I understand correctly, RDA uses canonical (numeric) and lexical (readable) properties. After #26 equivalentProperty is fixed, it'll include statements like

<http://rdaregistry.info/Elements/m/P30137> owl:equivalentProperty rdam:noteOnManifestation

The problem is that under OWL2 semantics, this will double the number of inferrable properties. And if such inference is not performed, you can't query using the properties of your choice, you can query only using the properties used in the particular dataset. Which may cause some datasets to use some props and others to use the other props.

This comment quotes Jon Phipps (Jan 23): "Our coining and inclusion of multilingual (eventually) lexical URIs based on the label is a concession to developers who feel that they can't effectively 'use' the vocabularies unless they can read the URIs."

I think this is an understatement. I feel strongly that readable vocabulary URIs are an important feature of any ontology.

In my experience, having to read turtle or jsonld (or god forbid RDF XML) in their "raw" form is part of the daily life of an ontologist or sem web developer. We want to read names, not (only) numbers!
IMHO a big part of the "Marc must die" movement is due to the use of unreadable codes in Marc
Give me ONE example of a successful ontology that doesn't use readable URIs.
Early variants of CIDOC CRM had numeric-only, but nobody used them, so now they use numeric & English.
I feel that if RDA uses numeric URIs, it will lose out to BibFrame, despite its modeling flaws.

I think you should follow the same approach as CRM and Getty associative relations (eg gvp:aat2811_preceded for styles, gvp:tgn3412_predecessor_of for nations): use a single RDA namespace, and include both the number and lexical name:

rda:P30137_noteOnManifestation

It's a very bad idea to have multi-language variants of the URIs. Translate the labels, but not the URIs.

I understand the big problem is variation of the lexical labels. But hasn't RDA stabilized enough to stop such variation?

consider using less particular names, then you'll prevent a lot of variation. Eg rda:P30137_note serves just as well, since its domain says Manifestation, and the number distinguishes it from eg "note on instance".
(BTW, what's wrong with using dc:description... but let's not start this discussion)

Cheers!

kcoyle commented 10 years ago

First, I am glad to see that these properties and classes will no longer be defined as "owl:sameAs". At the same time, I too feel that the use of a property to provide an alternate name is a heavy solution.

Doing a quick run through a reasoner, I get 17,406 axioms out of RDA. That is out of 9 classes and 1,190 object properties and 3,499 individuals (the difference is annotation properties). That's about 5 axioms per property. RDA is already "axiom-heavy" because of the sub-classing from un-constrained to constrained properties. It would take a little fiddling (which I don't have time for today) to calculate the number of axioms minus the equivalent property names, but just as a ball-park it should be possible to eliminate around 7K of those axioms by removing the equivalent properties. (Does this make a significant difference when using the vocabulary? I dunno.)

While I agree that working "by hand" with property names like "P200003" is difficult, tools like Protege and TopBraid are able to use the label instead of the property name, so perhaps the need to provide an alternate property is waning as tools improve. Increasingly one only needs human-readable labels -- which can also be safely provided in multiple languages. There are some dangers in using natural language terms for property names, and as time goes on we should all rely more on labels than reading IRIs. That said, I agree that human-readable property names is the norm today - which doesn't mean it's the best solution.

On 8/28/14, 1:33 AM, Vladimir Alexiev wrote:

(This is a hard one, and I expect it will be rejected, but I still want to voice my concern.)

If I understand correctly, RDA uses canonical (numeric) and lexical (readable) properties. After #26 https://github.com/RDARegistry/RDA-Vocabularies/issues/26 equivalentProperty is fixed, it'll include statements like

http://rdaregistry.info/Elements/m/P30137 owl:equivalentProperty rdam:noteOnManifestation

The problem is that under OWL2 semantics, this will double the number of inferrable properties. And if such inference is not performed, you can't query using the properties of your choice, you can query only using the properties used in the particular dataset. Which may cause some datasets to use some props and others to use the other props.

This comment https://github.com/RDARegistry/RDA-Vocabularies/issues/24#issuecomment-37457351 quotes Jon Phipps (Jan 23): "Our coining and inclusion of multilingual (eventually) lexical URIs based on the label is a concession to developers who feel that they can't effectively 'use' the vocabularies unless they can read the URIs."

I think this is an understatement. I feel strongly that readable vocabulary URIs are an important feature of any ontology.

In my experience, having to read turtle or jsonld (or god forbid RDF XML) in their "raw" form is part of the daily life of an ontologist or sem web developer. We want to read names, not (only) numbers!

IMHO a big part of the "Marc must die" movement is due to the use of unreadable codes in Marc

Give me ONE example of a successful ontology that doesn't use readable URIs.

Early variants of CIDOC CRM had numeric-only, but nobody used them, so now they use numeric & English.

I feel that if RDA uses numeric URIs, it will lose out to BibFrame, despite its modeling flaws.

I think you should follow the same approach as CRM and Getty associative relations http://vocab.getty.edu/doc/#Relationship_Representation (eg gvp:aat2811_preceded, gvp:tgn3412_predecessor_of relates): use a single RDA namespace, and include both the number and lexical name:

rda:P30137_noteOnManifestation

It's a very bad idea to have multi-language variants of the URIs. Translate the labels, but not the URIs.

I understand the big problem is variation of the lexical labels. But hasn't RDA stabilized enough to stop such variation?

consider using less particular names, then you'll prevent a lot of variation. Eg rda:P30137_note serves just as well, since its domain says Manifestation, and the number distinguishes it from eg "note on instance".

(BTW, what's wrong with using dc:description... but let's not start this discussion)

Cheers!

— Reply to this email directly or view it on GitHub https://github.com/RDARegistry/RDA-Vocabularies/issues/38.

Karen Coyle kcoyle@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet/+1-510-984-3600

jonphipps commented 10 years ago

Let's make sure we're talking about the actual problem, and ignore the triviality of the potential of millions of essentially redundant URIs (it really is relatively trivial).

To restate the problem:

RDA (in the context of this discussion, where we're not talking about the rules), is not an 'English' ontology, it is an abstract data model with entity attributes and concepts.

Let's pretend for a moment that these attributes and concepts are labeled and defined in semantically consistent (that's important) Chinese, Korean, Pashto, Arabic, Hebrew, Danish, Spanish, French, Russian, Italian, etc. (apologies to the thousands of people reading this comment who feel left out of the list -- trust me the list is as random as the rest of my brain). And then let's pretend that English isn't on the list at all, but a significant number of developers creating software that might use the data model only speak English and have to learn one of the languages in which it is available or have someone translate for them (this mirrors the experience of the non-English-speaking world). What would you use as a URI for these attributes and concepts?

I am by the way very tired of discussing this issue with English-speakers, especially the ones that say 'English in the URIs or I'll use something else'. Please offer an effective non-English solution, because English in the URIs is not an option that's on the table. I'm personally willing to listen.

There are a number of ways to make the use of lexical aliases (we're not calling them URIs at the moment) less 'painful'. The first is of course to ignore their existence and only use the canonical URIs in your data. Language-specific versions of the ontologies will be available, as the English version is now, that only contain canonical URIs and lexical aliases (as well as labels and definitions) for that particular language. The canonical version will contain all available languages. If you have a need for a multilingual approach with specific languages, you can reference the language-specific ontologies in your data and just load the ones you need. All will include the same canonical URIs, along with labels and definitions in the selected languages.

Dereferencing a lexical alias through the RDA vocab server will return a 308 redirect to the canonical URI. Well-written clients should recognize a permanent redirect and never request the lexical URI again, internally substituting the canonical URI for the lexical.

As @kcoyle points out, the best practice would be to use tools that display labels and definitions for humans while using the URIs in the data for machines, and ignore the lexical aliases entirely.

@VladimirAlexiev

I think this is an understatement. I feel strongly that readable vocabulary URIs are an important feature of any ontology.

'readable' by whom?

In my experience, having to read turtle or jsonld (or god forbid RDF XML) in their "raw" form is part of the daily life of an ontologist or sem web developer. We want to read names, not (only) numbers!

True! Then read the labels in turtle or jsonld and ignore the aliases.

IMHO a big part of the "Marc must die" movement is due to the use of unreadable codes in Marc

I don't agree. At all.

Give me ONE example of a successful ontology that doesn't use readable URIs.

MARC21. Define 'successful'. Again, 'readable' by whom? That really is the key issue, and I'm not trying to be obtuse or argumentative.

Early variants of CIDOC CRM had numeric-only, but nobody used them, so now they use numeric & English.

Could you maybe provide a citation that supports that assertion?

I feel that if RDA uses numeric URIs, it will lose out to BibFrame, despite its modeling flaws.

RDA is far more ambitious than BibFrame and much less well-funded and if 'there can be only one' and the bib metadata world is divided into winners and losers, then I'd be surprised (but pleased) if RDA 'won'. I sincerely doubt that BibFrame will prove to be as broadly useful as RDA because its 'modeling flaws' are fairly fundamental. It really boils down to whether anyone wants to use a FRBR-based model (BibFrame and schema.org talk the FRBR talk, but don't walk the walk). I also don't personally think that MARC must die, or that MARC has anything at all to do with BibFrame, or that BibFrame is an adequate 'replacement' for MARC, or that the RDA rules can be used to create non-FRBR (non-RDA) metadata, but really life's far too short and everyone's limited time is too valuable for that discussion.

VladimirAlexiev commented 10 years ago

ignore the triviality of the potential of millions of essentially redundant URIs (it really is relatively trivial). Dereferencing a lexical alias through the RDA vocab server will return a 308 redirect to the canonical URI. Well-written clients should recognize a permanent redirect and never request the lexical URI again, internally substituting the canonical URI for the lexical.

If you think of several billion statements in a repository, that may change your perspective. Using equivalent properties or classes multiplies the number of statemetns (inferred or inferrable), thus makes the job of reasoners and repositories quite harder.

Language-specific versions of the ontologies will be available..contain canonical URIs and lexical aliases ...
All will include the same canonical URIs, along with labels and definitions in the selected languages.

Multilingual labels and definitions are good. But please DON’T define URIs in different languages. This will either lead to creation of semantically incompatible data, or make the work of reasoners and repositories harder.

Let's pretend for a moment that these attributes and concepts are labeled and defined in semantically consistent (that's important) Chinese, Korean, Pashto, Arabic, Hebrew, Danish, Spanish, French, Russian, Italian, etc. What would you use as a URI for these attributes and concepts?

Let's pretend English is not the dominant language in IT. Let's pretend there is a widely used computer language that's not based on English. Let's pretend there is a widely used computer library whose classes and methods are not based on English. Let's not.

I am by the way very tired of 'English in the URIs or I'll use something else'.

I'll use whatever the client requests, or whatever is agreed by a particular consortium. It'd just be much easier for me to use English instead of numeric URLs.

readable vocabulary URIs are an important feature of any ontology. 'readable' by whom? As @kcoyle points out, the best practice would be to use tools that display labels and definitions for humans while using the URIs in the data for machines, and ignore the lexical aliases entirely.

As a practicing ontologist/semantic developer, I read Turtle daily. (BTW, the best tools for converting to Turtle are rdf2rdf and rdfcat). I know other people who read JSON-LD daily. And I have to write SPARQL, lots of it, daily. I don’t know a SPARQL tool that will let me search & autocomplete using labels of properties, while YASGUI lets me do that with URLs of properties.

Maybe I'm too backwards in my usage of tools? But maybe I'm just more practicing than some of the other people here?

We want to read names, not (only) numbers! True! Then read the labels in turtle or jsonld and ignore the aliases.

Khmh? You mean using a tool that will fetch the ontology labels for me?

Give me ONE example of a successful ontology that doesn't use readable URIs. MARC21.

I thought that's a XML schema. I think one of the first papers discussing an RDF representation of MARC21 is http://events.linkeddata.org/ldow2008/papers/02-styles-ayers-semantic-marc.pdf In there they talk of STRAIGHT-FORWARD REPRESENTATION versus READABLE REPRESENTATION :-)

Define 'successful'.

Very good point. Guess I mean "used widely and with enjoyment" :-)

Early variants of CIDOC CRM had numeric-only, but nobody used them, so now they use numeric & English. Could you maybe provide a citation that supports that assertion?

http://cidoc-crm.org/official_release_cidoc.html

December 2010 version: http://cidoc-crm.org/rdfs/cidoc_crm_v5.0.2.rdfs Uses numeric URLs. AFAIK, there are NO datasets using this.
December 2011 version: http://cidoc-crm.org/rdfs/cidoc_crm_v5.0.4_english_label.rdfs Uses e.g. "E1.CRM_Entity", which is not Turtle 1.0 friendly (cannot use "." in local name).
December 2012 version: http://cidoc-crm.org/rdfs/cidoc_crm_v5.0.4_official_release.rdfs Uses e.g. "E1_CRM_Entity". Based on the same spec version 5.0.4. It only took only a year to convince them that "." in local names is a bad idea. The above convention was adopted end of 2011 (from Erlangen CRM).
current version: http://cidoc-crm.org/rdfs/cidoc_crm_v5.1-draft-2014March.rdfs Uses e.g. "E1_CRM_Entity"
The largest CRM datasets (British Museum, Polish Digital Library) use Erlangen CRM, mostly because work on them started before Dec 2012. The slow reaction of the CRM SIG to change a measly "." to "_" has caused this choice of an alternative representation.
The Italian dati.culturaitalia.it uses Erlangen CRM with a dated namespace (an unfortintate choice).
A much smaller dataset (CLAROS) uses E1.CRM_Entity, but not the official namespace.

Cheers!

jonphipps commented 10 years ago

@VladimirAlexiev Maybe we should back up just a bit...

URIs that have no attachment to any single language is an RDA requirement.

I see many reasons, both practical and political why this is so, but realistically it's a decision that has been made by the RDA community and isn't going to be reconsidered. I suspect that other communities will come to a similar conclusion and I expect non-semantic, or non-lexical, or opaque URIs (whatever we wish to call them) will become ever more common. I expect this to be wildly unpopular with developers and that eventually tool developers will begin to support label-based code completion and other helpers to ease the aggravation. I hope that the architecture of data on the Global Web of Data won't continue to be defined by the limitations of the tools. I hope that effective knowledge representation isn't limited by simple popularity and developer comfort.

But at the moment, I completely agree with you. I'm old enough to remember when every single stored byte mattered and compact property and attribute encoding (like MARC, but less clever) was an absolute necessity, and a huge pain in the ass. I also am well aware of how difficult it is to write ad hoc SPARQL queries with URIs that make no sense, can't be remembered, and are useless for code completion. It's hard and debilitating to try to 'play' with RDF data encoded with opaque URIs. But the RDA ontology wasn't intentionally designed to support ad hoc SPARQL queries. It was designed (and not by me) for multicultural, multilingual, bibliographic knowledge transfer.

I'm also aware of the added complications of processing OWL-full entailed by the inclusion of a large number of property relations. But there's a tension that must be resolved between play and production, between low barriers of entry and ease of access by English-speaking developers and a broader, more global community. The lexical aliases are intended to be an entirely optional way of supporting play, exploration, experimentation. If they're ultimately used in production data and published to the open web, that's an easy enough thing to fix after-the-fact (or should be).

There are lots of ways to see the careful, optional inclusion of the lexical aliases as an opportunity rather than a threat. For instance, I see a distinct similarity between opaque URIs with lexical aliases and 'opaque' IP addresses with domain names. And as I continue to repeat, we're happy to discuss alternatives to the lexical aliases that don't involve a single canonical URI that embeds a culture-specific, semi-semantic label in the URI. That's a non-starter.

kcoyle commented 10 years ago

A thought about rdfs:label:

If we do rely on labels to give us some human meaning shorthand while working with URIs, we have the advantage that labels can be in multiple languages, but we have the disadvantage that there is no requirement that labels be unique within a definable boundary.

This tells me that we may wish to make use of skos:prefLabel, of which the rules state that there can be only one per language.

On 9/4/14, 12:15 PM, Jon Phipps wrote:

@VladimirAlexiev https://github.com/VladimirAlexiev Maybe we should back up just a bit...

URIs that have no attachment to any single language is an RDA requirement.

I see many reasons, both practical and political why this is so, but realistically it's a decision that has been made by the RDA community and isn't going to be reconsidered. I suspect that other communities will come to a similar conclusion and I expect non-semantic, or non-lexical, or opaque URIs (whatever we wish to call them) will become ever more common. I expect this to be wildly unpopular with developers and that eventually tool developers will begin to support label-based code completion and other helpers to ease the aggravation. I /hope/ that the architecture of data on the Global Web of Data won't continue to be defined by the limitations of the tools. I /hope/ that effective knowledge representation isn't limited by simple popularity and developer comfort.

But at the moment, I completely agree with you. I'm old enough to remember when every single stored byte mattered and compact property and attribute encoding (like MARC, but less clever) was an absolute necessity, and a huge pain in the ass. I also am well aware of how difficult it is to write ad hoc SPARQL queries with URIs that make no sense, can't be remembered, and are useless for code completion. It's hard and debilitating to try to 'play' with RDF data encoded with opaque URIs. But the RDA ontology wasn't intentionally designed to support ad hoc SPARQL queries. It was designed (and not by me) for multicultural, multilingual, bibliographic knowledge transfer.

I'm also aware of the added complications of processing OWL-full entailed by the inclusion of a large number of property relations. But there's a tension that must be resolved between play and production, between low barriers of entry and ease of access by English-speaking developers and a broader, more global community. The lexical aliases are intended to be an entirely optional way of supporting play, exploration, experimentation. If they're ultimately used in production data and published to the open web, that's an easy enough thing to fix after-the-fact (or should be).

There are lots of ways to see the careful, optional inclusion of the lexical aliases as an opportunity rather than a threat. For instance, I see a distinct similarity between opaque URIs with lexical aliases and 'opaque' IP addresses with domain names. And as I continue to repeat, we're happy to discuss alternatives to the lexical aliases that don't involve a single canonical URI that embeds a culture-specific, semi-semantic label in the URI. That's a non-starter.

— Reply to this email directly or view it on GitHub https://github.com/RDARegistry/RDA-Vocabularies/issues/38#issuecomment-54529831.

Karen Coyle kcoyle@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet/+1-510-984-3600

VladimirAlexiev commented 10 years ago

Thanks everyone for the thoughtful discussion! I expected this outcome, but I still think documenting the reasons in this discussion is useful.

On @kcoyle's last suggestion: an additional problem with rdfs:label is that both skos:prefLabel and skos:altLabel infer it, so it becomes even more non-unique. But I think that using rdfs:label for property and class names is pretty much ingrained in defining ontologies, so let's leave it as is.

VladimirAlexiev commented 8 years ago

See what Robert Sanderson has to say about it (one of the few positive things that can be said about BibFrame :-) https://docs.google.com/document/d/1dIy-FgQsH67Ay0T0O0ulhyRiKjpf_I0AVQ9v8FLmPNo/edit#heading=h.wdueeer7z0xc

adamretter commented 4 years ago

So fast-forward 5 or 6 years, and I have rediscovered RDA via Matterhorn RDF.

Unfortunately I think even after all this time the points made by @VladimirAlexiev stand true.

We are currently evaluating the RDA Ontologies for use as part of our Catalogue Description at a National Archive. Whilst RDA looks very useful and should be adopted, unfortunately the canonical naming of the entities and properties is unusable by humans/developers. For us this is a serious consideration against using RDA.

Could someone, perhaps @kcoyle or @jonphipps , tell me if any progress been made on this since 2014? I have seen the Lexical form of the URIs, but it is not clear to me if it is sensible/sane to actually use them in our RDF instead of the canonical properties. We value interoperability highly, can we get that with the lexical form?

It may be that the design requirements of RDA's URI being language agnostic doesn't match our requirements. Which is of course totally fine... but I didn't want to discount RDA without first enquiring on the latest status.

kcoyle commented 4 years ago

@adamretter I'm not aware of any changes in how RDA names its entities and properties. I will note, however, that Wikidata, which is highly popular at the moment (great potential, not yet mainstream in terms of use), also uses non-lexical identifiers similar to RDA, e.g. https://www.wikidata.org/wiki/Q12807. That community has accepted this type of naming so it might be interesting to understand how they came to this agreement and how people feel about working with it. I took a quick look at the discussion archives but couldn't find anything in the early discussions. It might be worth pinging some of the WD folks to get a picture of how folks like/dislike working with non-lexical identifiers. In terms of identifiers alone, the "mnemonic" v "opaque" is a long-standing discussion without a clear agreement (as we see here).

The Open Library developed a URL display that included the language label, and I think that was a display-only function - the label was not actually part of the URL, but included in browser display "on the fly":

https://openlibrary.org/works/OL15331606W/Il_nome_della_rosa

If needed, I know who could explain how that was done.

That doesn't make it easier for developers necessarily, because the ugly identifier is still here, but it does seem to provide some comfort level for the human user.

VladimirAlexiev commented 4 years ago

Wikidata has great support for the unreadable IDs in the SPARQL editor: autocompletion based on label and pop-up readout. There is also auto-generated "comment" text in SHEX. How that came to be: nobody asked :-) . But people use it because there is tool support and tons of data.

@kcoyle as for WD not being mainstream: it has 15-20M scientific articles. Can you point to an RDA repo of a similar size?

kcoyle commented 4 years ago

@VladimirAlexiev By mainstream I meant in use by folks who aren't primarily focused on WD. It isn't a question of numbers but of diffusion. An example of mainstream would be Google search. Or Wikipedia. I think it would be great if Wikidata became so usable that it became part of our everyday lives, but it isn't there yet. I do think, though, because it has an active hacker community, unlike RDA which is a closed system, that its solutions will spread. So here's to WD going mainstream!

VladimirAlexiev commented 4 years ago

@kcoyle We are largely in agreement, but:

Surely the Google KG uses Wikidata. When it shows you a KG info-card and links to Wikipedia, most of the data actually comes from Wikidata
Wikidata is the most active and fastest growing MediaWiki project, maybe only second to English wikipedia