Literal example for dct:language

tombaker commented 6 years ago

The comment for http://purl.org/dc/terms/language currently reads:

Recommended best practice is to use a controlled vocabulary such as RFC
4646.

The ISO WG proposes two comments and some examples:

Recommended practice is to use a controlled vocabulary
such as ISO 639-2 or ISO 639-3. 

Best current practice 47 [BCP 47] should be used if the
metadata will be used mainly in the Internet.  

EXAMPLES    eng (ISO 639-2)
            en-US (BCP 47)
            http://catalogue.bnf.fr/ark:/12148/cb119308987 (English language in Rameau)

tombaker commented 6 years ago

Unfortunately, the range of dct:language is already ambiguous:

its formal range is dct:LinguisticSystem (as also defined in 15836-2, Section 3.4) which implies an object, not a literal.
The usage comment recommends use of RFC 4646, which is a set of tags (i.e., string literals), not URIs.

In 2015 Bernard Vatant posted to dc-architecture the following points:

The LOV community wanted to recommend that languages be declaring using URIs such as http://lexvo.org/id/iso639-3/eng or http://id.loc.gov/vocabulary/iso639-1/en but there was no clear standard for language URIs.
"Against all arguments that everything should be identified by a URI", he concluded that language tags are actually a better solution than URIs for identifying languages.

osma commented 6 years ago

Bernard Vatant ends his post (linked above) with this conclusion:

Note that the current language URIs do not provide URIs for localized languages, such as pt-BR or pt-PO or other tags extension for scripts like zh-Hant-HK for Traditional Chinese as used in Hong Kong. So in the current state of affairs, tags are standardized, well documented, widely used and implemented, and highly flexible.

I tend to agree. The current URI sources for languages are problematic - there are too many of them (Lexvo, Lingvoj, LOC, Rameau...), they are not necessarily being maintained, and they do not support variant tags such as localized languages and scripts (e.g. the ones mentioned above). In my view, DCTerms is about sharing metadata mainly on the Internet, so recommending BCP 47 (the successor of RFC 4646) language tags would be the most sensible choice here and also probably the most flexible system available.

aisaac commented 6 years ago

I still like the idea of encouraging use of resources like the Language NAL from the European Publications Office: http://publications.europa.eu/mdr/authority/language/index.html

osma commented 6 years ago

@aisaac Oh, another language code list? This is starting to resemble xkcd 927 :)

The funny thing here is that pretty much all of these are based on the ISO 639 code lists (well, except the MARC code list for languages which predates them), but everyone coins their own URIs and probably uses an ISO 639 snapshot from a slightly different point in time and has a different update frequency (or none at all).

That doesn't mean we couldn't encourage their use, it's just that I have a hard time seeing any advantages in using URI sets (especially many different ones) for languages when a perfectly good string based coding scheme exists (BCP 47), and those strings may be used to look up additional information from resources like Lexvo when necessary.

osma commented 6 years ago

Notes from examining LOD-a-lot w.r.t usage of dc:language and dct:language properties:

For dc:language, 99.9% of values are literals, most often (apparently) ISO 639-1 codes such as en and ru, but sometimes also en-us or (curiously) "en"@en or "it"@it, i.e. a literal language tag that claims to be in the language it represents! For IRI values, the most common namespaces seem to be MusicBrainz and Lexvo.
For dct:language, 99.7% of values are IRIs. Most common namespaces are LOC ISO639-2 codes and Lexvo. For literal values, the most common are (apparently) ISO 639-1 codes as plain literals. I did find one with a dct:RFC4646 datatype in my small sample though.

I also checked Openlink, it's quite different.

For dc:language, 87.6% of values are literals.
For dct:language, 60.3% of values are literals and 39.7% are IRIs.

Some conclusions:

People seem to prefer dc:language with literal values and dct:language with IRIs. This is especially clear in the LOD-a-lot data.
Conventions vary quite a lot.
With language tags expressed as literal values, plain literals are most often used, which means it's impossible to know for sure what coding system was used.

I'd like to avoid pushing people to use the old dc:language (DC Elements 1.1) property because they "only" have a literal language tag (which in some cases is more expressive than any existing IRI) and encourage use of dct:language. I will try to formulate a suggestion that encompasses both the use of language entities from a controlled vocabulary and sane use of language tags (especially BCP 47) as literal values.

osma commented 6 years ago

PLEASE VOTE THUMBS UP / DOWN

My proposal, taking into account the points above:

Recommended best practice is to use either a non-literal value representing
a language from a controlled vocabulary such as ISO 639-2 or ISO 639-3, 
or a literal value consisting of an IETF Best Current Practice 47 [BCP 47] language tag.

EXAMPLES      http://id.loc.gov/vocabulary/iso639-2/eng
                (English language in the Library of Congress vocabulary of ISO 639-2 language codes)
              http://lexvo.org/id/iso639-3/spa
                (Spanish language in the Lexvo vocabulary of ISO 639-3 language codes)
              http://catalogue.bnf.fr/ark:/12148/cb119308987 (English language in Rameau)
              "en-US"      (BCP 47 tag for United States English)
              "zh-Hant-HK" (BCP 47 tag for Hong Kong Chinese in Traditional script)

osma commented 6 years ago

The proposal above ignores the issue of data types. DCTerms defines the data types dct:ISO639-2, dct:ISO639-3, dct:RFC3066, dct:RFC4646 and dct:RFC5656 which, in principle, could be used to express which coding system was used in a literal language tag. They appear to be little used based on my experience and the sampling of large data sets above.

RFC5646 obsoletes RFC4646 (which obsoleted RFC3066) but newer RFCs should be backwards compatible with the older ones. BCP47 is a pointer to RFC5646 currently but in the future may be changed to point to a newer (presumably backwards compatible) RFC. I see little point in expressing a specific RFC using a data type on a language tag because of the backward compatibility guarantees (both explicit and implicit). BCP47 is ubiquitous in web technology (used in e.g. HTML and XML).

For ISO 639-X codes, I think using an IRI to refer to an entity from a controlled vocabulary (e.g. LOC or Lexvo) corresponding to that code is a better alternative than a literal language tag with a data type such as dct:ISO639-2.

aisaac commented 6 years ago

I've -1 not because I think the form is bad (it is good) but because I think it sends a message that only codes are accepted, or vocabularies derived/based on codes. The examples suggested by ISO include an example of controlled vocabulary that is not code-focused (or code-derived) http://catalogue.bnf.fr/ark:/12148/cb119308987 (English language in Rameau). I think this message should be preserved, as it is friendly to a range of scenarios that port legacy data to Linked Data, and can clearly not be said to be bad practice.

I don't think the message needs to be included in the main sentence ("Recommended best practice [...]"): keeping one representative example should be fine for users to understand this is allowed.

kcoyle commented 6 years ago

+1 to not sending people to dc1.1 because of ranges.

+1 to giving one example of a language name (in a language)

?1 to giving an example of a language name + "@xx" ("French"@en)

osma commented 6 years ago

@aisaac Good point. I don't oppose using Rameau as an example. I edited the proposal, since only three people (including myself) have voted so far - I hope it's fine to you @kcoyle !

Regarding @kcoyle's suggestion of using a language name such as "French" or "French"@en as an example, I'm somewhat uncomfortable with that. I feel that if we suggest "French" as a valid value for dct:language, then we should also accept e.g. "allemand", "sveitsinsaksa", "võro", "läti", "finlandssvenska", "Арабскый" and many others. I'd hate to be the one who has to aggregate metadata that uses names of languages like this, even though I personally happen to know what these are. Lexvo alone lists some 72k names of languages, or nearly 90k if you include variant names. I know there is a tension between machine readability and ease of creating metadata, but it's not that difficult to use a code list or other vocabulary of languages, or if you reallly only have the literal name, look up the corresponding entity from an authoritative list.

kcoyle commented 6 years ago

@osma @aisaac I may have misinterpreted Antoine's "I think it sends a message that only codes are accepted, or vocabularies derived/based on codes." I thought he was suggesting natural language versions. I'm fine with limiting to known lists, and the property has a range of LinguisticSystem so it should be a IRI. If we loosen the ranges it could be a plain literal. But I'm not opposed to NOT adding a plain language example - I just thought that's what had been suggested. I'm sure someone will use them in their data, but we don't have to encourage it.

aisaac commented 6 years ago

Hmmm, I think Europeana is one of these unfortunate organizations which have to handle values of dct:language expressed as literals, actually :-/ We are going to try to normalize them soon, by using ISO codes and also the EC publication NAL that I've mentioned before. As this is the data we're getting, and we can't required limited-resources providers to change their data, we have to accept it as legit. But of course I can only encourage a recommended best practice NOT to mention them ;-)

tombaker commented 6 years ago

I upvoted this because it is close enough. However, we should perhaps be consistent as to whether comments refer to "non-literal values" (here) or more specifically to URIs (see #30).

I'm thinking we should refer to URIs because RDF users would already know that a blank node would suffice, whereas users not familiar with RDF might find "non-literal value" a bit confusing. I do realize that deciding for "URI" would require us to look back at other places where "non-literal values" are mentioned...

kcoyle commented 6 years ago

I agree that "non-literal value", as it is used here, is not a widely shared definition. If we mean URI we should say URI.

osma commented 6 years ago

@tombaker There is another issue #41 opened by @jneubert about using "URI" vs. "non-literal value" consistently. Of course this text can be amended accordingly if we can decide on that first.

@aisaac I added the Rameau example, is that enough for you to change your vote to +1?

juhahakala commented 6 years ago

I like the tone of Osma's proposal, but would like to modify it into:

"Recommended best practice is to use a non-literal value representing a language from ISO 639-2 or ISO 639-3 or an ISO 639-based language code from IETF Best Current Practice (BCP) 47".

ISO 639 examples should take the user to the sites hosted by ISO 639-2 and ISO 639-3 maintenance agencies (The Library of Congress and SIL). For spanish, that would mean

http://id.loc.gov/vocabulary/iso639-2/spa https://iso639-3.sil.org/code/spa

The BCP 47 language code for Spanish spoken in Spain is es-ES consisting of two-letter code from ISO 639-1 and country code. There are many other options such as es-MX and es-AR which are not included in ISO 639 (spanish is not regarded as macrolanguage).

Allowing both ISO 639 and BCP 47 is a bit confusing but unavoidable. But I don't think that any other options should be discussed / offered. And since we have well maintained code lists, using language names should be discouraged.

osma commented 6 years ago

OK, so @juhahakala suggests that only ISO 639-2 and ISO 639-3 codes (expressed as either LOC or SIL URIs? or plain literals?) and BCP 47 language tags should be offered, while @aisaac suggested (in https://github.com/dcmi/usage/issues/22#issuecomment-401047844) that e.g. Rameau languages should be accepted (and I added a Rameau language example). We obviously cannot follow both suggestions at the same time, so do we have to have a separate vote on this?

juhahakala commented 6 years ago

ISO draft has been edited so that it uses Osma's version, with some minor changes (ISO 639-3 link to the web site of the standard maintenance organization SIL; term United States English replaced with American English to accommodate also English speaking Canadians).

Usage of other controlled vocabularies such as Rameau (which is included as an example) is OK. Parsing the data in dct:language may be non-trivial if large number of controlled vocabularies are used, but it is probably unrealistic to assume that people would use ISO 639 and BCP 47 only.

Note 1 to entry: Recommended practice is to use either a non-literal value representing a language from a controlled vocabulary such as ISO 639-2 or ISO 639-3, or a literal value consisting of an IETF Best Current Practice 47 [BCP 47] language tag.

EXAMPLES http://id.loc.gov/vocabulary/iso639-2/eng (English language in the Library of Congress vocabulary of ISO 639-2 languages) http://iso639-3.sil.org/code/eng (English language in the SIL International vocabulary of ISO 639-3 languages) en-US (BCP 47 tag for American English) zh-Hant-HK (BCP 47 tag for Hong Kong Chinese in Traditional script) http://catalogue.bnf.fr/ark:/12148/cb119308987 (English language in Rameau)

tombaker commented 6 years ago

It looks to me like we are only disagreeing on details of the usage recommendation, which we can handle in usage sections in the DCMI terms documentation. Closing for now.

aisaac commented 6 years ago

@osma I've upvoted your new proposal, sorry for the delay.

I think in the end @juhahakala 's comment (and re-using) of @osma 's version means we all agree.

But I would like to re-open the issue, just to make sure we just have one version we all vote for. If @juhahakala changed something, even a reference, I don't see why we wouldn't enshrine this, rather than coming back to it after we've forgotten the discussion.

tombaker commented 6 years ago

APPROVED

The ISO draft currently says:

Note 1 to entry: Recommended practice is to use either a non-literal value
representing a language from a controlled vocabulary such as ISO 639-2 or ISO
639-3, or a literal value consisting of an IETF Best Current Practice
47 [BCP 47] language tag.

EXAMPLES  http://id.loc.gov/vocabulary/iso639-2/eng
    (English language in the Library of Congress vocabulary of ISO 639-2 languages)
      http://iso639-3.sil.org/code/eng
    (English language in the SIL International vocabulary of ISO 639-3 languages)
      en-US (BCP 47 tag for American English)
      zh-Hant-HK (BCP 47 tag for Hong Kong Chinese in Traditional script)
      http://catalogue.bnf.fr/ark:/12148/cb119308987 (English language in Rameau)

This follows Osma's proposal, which has been upvoted by Osma, Tom, Karen, and Antoine. A few more votes are needed for approval.

tombaker commented 6 years ago

Re-opened in order to close

dcmi / usage

Literal example for dct:language #22

PLEASE VOTE THUMBS UP / DOWN

APPROVED