abdelazer / openpub

Automatically exported from code.google.com/p/openpub
0 stars 0 forks source link

Define the exact value-space of dc:language #35

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Unless you define the expected value-space of dc:language you will get 
something like a mix of:

<dc:language>de</dc:language>
<dc:language>DEU</dc:language>
<dc:language>ger</dc:language>
<dc:language>Deutsch</dc:language>
<dc:language>German</dc:language>

And for multiple-language publication you will get:

<dc:language>German, Englisch; Spanish / French</dc:language>

And of course many many more variations...

Original issue reported on code.google.com by siehea...@googlemail.com on 26 May 2010 at 8:57

GoogleCodeExporter commented 9 years ago
There is a recommended best pratice in the DCMI's spec : 
"""Comment:Recommended best practice is to use a controlled vocabulary such as 
RFC 
4646 [RFC4646]."""
[http://dublincore.org/documents/dcmi-terms/#terms-language]

It might be more natural to follow this recommendation

Original comment by zeta....@gmail.com on 26 May 2010 at 10:19

GoogleCodeExporter commented 9 years ago
Implementors working in this area need to be aware of a long-standing bug in 
libxml2
which affects validation of many minority languages:

https://bugzilla.gnome.org/show_bug.cgi?id=606592

Original comment by syea...@gmail.com on 26 May 2010 at 10:32

GoogleCodeExporter commented 9 years ago
"recommended best practice is to use a controlled vocabulary such as RFC 
4646" means that people *may* use *some* controlled vocabulary *for example* 
RFC 4646 
or its variations or something else that they believe to be a controlled 
vocabulary 
(I one catalog always used the same numbers to identify languages it is also a 
controlled vocabulary). This does not help at all - either you define something 
more 
specific or clients must use heuristics to guess the language(s). Forthermore 
it is 
not clear whether multiple languages in one element are allowed or you must 
repeat 
the dc:language element - I bet you can do both and that is the way it will end 
up: 
everyone uses his one little variation.

I propose the require that the value of dc:language must conform to RFC 3066 
which is 
a regular expression:

* LanguageID ::= Langcode ('-' Subcode)*
* Langcode ::= ISO639Code |  IanaCode |  UserCode
* ISO639Code ::= ([a-z] | [A-Z]) ([a-z] | [A-Z])
* IanaCode ::= ('i' | 'I') '-' ([a-z] | [A-Z])+
* UserCode ::= ('x' | 'X') '-' ([a-z] | [A-Z])+
* Subcode ::= ([a-z] | [A-Z])+

Original comment by siehea...@googlemail.com on 27 May 2010 at 7:28

GoogleCodeExporter commented 9 years ago
Sounds reasonable. We could offer similar advices for atom:category/dc:subject 
(recommend using BISAC, LOC etc.) although it won't be mandatory in that case.

Original comment by hadrien....@gmail.com on 27 May 2010 at 7:31

GoogleCodeExporter commented 9 years ago
I agree with you, on having a required controlled vocabulary.

As for the RFC number why are you advising for RFC 3066 ? IETF website seem to 
say 
that RFC 4646 obsolete RFC 3066 (look at the header here: 
http://www.ietf.org/rfc/rfc4646.txt)?

You seem to be much more knowledgeable than me of the subject, so is there any 
issue 
with going with RFC 4646 ?

Original comment by zeta....@gmail.com on 27 May 2010 at 7:48

GoogleCodeExporter commented 9 years ago
sieheauch's regexp appears to reproduce the libxml2 bug in excluding 
three-letter
character codes.

Several of the ePubs from http://www.nzetc.org/ contain fragments in languages 
which
have only three letter codes (mainly Pacifika languages). These are languages 
that
people actually speak and people actaully care about.

Original comment by syea...@gmail.com on 27 May 2010 at 9:24

GoogleCodeExporter commented 9 years ago
What you really need is a way to advertise WHICH controlled vocabulary you are 
using.
While for language, the openpub standard could insist upon RFC 4646, and that 
might
likely satisfy nearly everyone -- for subject vocabularies there is unlikely to 
be
one universal category meeting all needs, it's really going to be neccesary to
provide a way to advertise exactly what vocabulary you are using in a
machine-understandable way. And this doesn't hurt for language too, although 
OpenPub
could "strongly recommend" one particular one like RFC 4646.  

Unfortunately, dc:language and dc:subject don't give you an obvious easy way to 
do
this, they might need to be 'extended'. Perhaps with an attribute 'vocabulary' 
which
must contain a URI identifiying a vocabulary (and to guard against the danger 
that
people will choose different URIs for the same vocabulary, provide the 
recommended
URIs for certain common vocabularies like RFC 4646). 

I'm not sure exactly what it takes to legally extend a dc:* element with an
attribute, this has always confused me. 

Original comment by rochk...@jhu.edu on 27 May 2010 at 10:59

GoogleCodeExporter commented 9 years ago
For atom:category we can rely on the scheme attribute. For dc:language... 
xsi:type ?

Original comment by hadrien....@gmail.com on 27 May 2010 at 11:16

GoogleCodeExporter commented 9 years ago

Original comment by hadrien....@gmail.com on 15 Jul 2010 at 4:26