agiorguk / gemini

Resources relating to the UK Gemini metadata profile
5 stars 3 forks source link

Current GEMINI encoding of dataset language and metadata language is not valid in INSPIRE #68

Open Sgaff opened 3 years ago

Sgaff commented 3 years ago

Hi,

The current guidance on the GEMINI pages for metadata language and for dataset language states that the codelist string that users should quote for the ISO language codes is

http://www.loc.gov/standards/iso639-2/php/code_list.php

However, if you attempt to run a full XML file through the INSPIRE validator with this encoding in it, it fails on the language element. After some playing around, and looking in inspire-tg-metadata-sio19139-2.0.1.pdf, I identified the problem INSPIRE had as being the presence of the /php/code_list.php part of the string.

I re-built my XML so the language portion was as follows

English and the INSPIRE validator accepted this without issue. I propose that we change our guidance on the website to reflect this subtle difference and wonder does this mean that we need to change our Schematron checks? Cheers Sean
nmtoken commented 3 years ago

Schematron only checks there is a 3 letter language code in the codeListValue; no check of codelist URI.

Sgaff commented 3 years ago

That's good then from the point of view of the change, as it would purely be edits in the website.

nmtoken commented 3 years ago

If you go to page http://www.loc.gov/standards/iso639-2/ you can see that there is a link to ISO 639-2 Code List from it, so http://www.loc.gov/standards/iso639-2/ is not a link to the code list, and if the INSPIRE validator expects it, then that must surely be an error.

Valid URLs to the code list are:

A link to the code eng in the code list is:

https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=130

Basically think that INSPIRE validator is wrong for rejecting URLs with php in them

Sgaff commented 3 years ago

Can we feed this back to INSPIRE then so they can do a corrigendum? As the https://www.loc.gov/standards/iso639-2/ is the example encoding in the TG and that would need to be changed as well.

Sean

Sgaff commented 3 years ago

I'll also take view, based on James' comments, that the GEMINI interpretation is correct and will leave it this way for imminent MEDIN release.

PeterParslow commented 3 years ago

@Sgaff : could you raise it as a new issue against the INSPIRE TG, at https://github.com/INSPIRE-MIF/technical-guidelines/issues? Or if it's more an issue with their validator than their text, then raise it at: https://github.com/INSPIRE-MIF/helpdesk-validator

And then close it here.

PeterParslow commented 2 years ago

Related issue / pull request at INSPIRE MIF: https://github.com/inspire-eu-validation/metadata/pull/175.

Note: this the validator sticking making the current implementation more tolerant, but not taking into account James' view here that they should be targeting something that returns a value.

archaeogeek commented 2 months ago

@PeterParslow to check whether the suggestion above (to link to the actual value in the codelist) is still valid for inspire, and potentially raise it as an issue with them

PeterParslow commented 2 months ago

Sean is (unsurprisingly) correct that http://www.loc.gov/standards/iso639-2/ is not

Metadata language GEMINI Guidance 1. "It is recommended to select a value from a controlled vocabulary, for example that provided by ISO 639-2 which uses three-letter primary tags with optional subtags." rather softer than INSPIRE TG Requirement C.5: metadata/2.0/req/common/metadata-language-code

Similarly, Dataset language (INSPIRE "Resource language") GEMINI Guidance 1. "A code should be selected from ISO 639-2, which uses three-letter primary tags with optional subtags – see http://www.loc.gov/standards/iso639-2/php/code_list.php" is softer than TG Requirement 1.6: metadata/2.0/req/datasets-and-series/resource-language

(wording below is from INSPIRE TG Requirement C.5; 1.6 is almost identical) "The language of the provided metadata content shall be given. It shall be encoded using gmd:MD_Metadata/gmd:language/gmd:LanguageCode element. The attribute codeListValue shall contain one of the three-letter language codes of the ISO 639-2/B code list. The attribute codeList shall be either http://www.loc.gov/standards/iso639-2/ or http://id.loc.gov/vocabulary/iso639-2.

Only the code values for the languages of the Community[19] shall be used.

The multiplicity of this element is 1."

Historically, this was because GEMINI allowed itself to be used for records that were not INSPIRE compliant, for example in Welsh - not one of the "languages of the Community" but represented by an ISO 639 code (sadly, two!).

We could harden our Guidance 1 to require ISO 693-2 3-letter codes. This would technically be a breaking change, but may not impact any instances (would need to check).

James is correct (Oct 6, 2021) that http://www.loc.gov/standards/iso639-2/ is not a link to the code list; it's a link to a kind of landing page about the code list. https://www.loc.gov/standards/iso639-2/php/code_list.php links to the ISO 639-2 code list. But INSPIRE requires https://www.loc.gov/standards/iso639-2/php/ or http://id.loc.gov/vocabulary/iso639-2 - that second one does actually redirect to the ISO 639-2 code list: albeit a rather different visual representation of it!

My opinion: direct links to individual codes (e.g. https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=130) would be useful, but should not be the value of the codeList attribute, which after all is a URL for the list, not the individual value. I don't see anywhere in an ISO 19139 code list to put a direct link to the individual value. Nor would that be valid for INSPIRE.

What would be valid INSPIRE and at least give a link that goes more directly to the code list, would be to require/recommend http://id.loc.gov/vocabulary/iso639-2 - but I don't know if anyone has been using that so far!

Personal thought: no further GEMINI change; we've got the INSPIRE validator "softened" / corrected