SAA-SDT / EAD3

https://www.loc.gov/ead/index.html
Creative Commons Zero v1.0 Universal
80 stars 25 forks source link

Schematron bug, related to ISO 15511 regular expression pattern #549

Open fordmadox opened 2 years ago

fordmadox commented 2 years ago

While testing the new schematron file for EAC 2.0, I noticed that the regex borrowed from the EAD3 schematron has a small bug. For example, the following value is valid according to the EAD3 schematron:

US-oclc-12345678901

However, that is a fake 19 digit code, which should NOT be valid. That same 19-digit code is, correctly, not valid in EAD2002 nor EAC 1.0.

I am going to recreate that pattern for EAC 2.0 by following, essentially, the EAD2002 model, which does validate the country code, when present. Since we are validating the country code elsewhere, it seems like we should do that here, as well, rather than just using a two-character match pattern for that. Anyhow, here's the current EAD3 regex:

(^([A-Z]{2})|([a-zA-Z]{1})|([a-zA-Z]{3,4}))(-[a-zA-Z0-9:/-]{1,11})$

Whereas that should probably be (though NOT tested):

^(([A-Z]{2})|([a-zA-Z]{1})|([a-zA-Z]{3,4}))(-[a-zA-Z0-9:/-]{1,11})$

To decide:

Should we:

  1. update the regex as is so that invalid codes up to 19 digits will not be able to validate (the max length is 16 digits)?
  2. update the regex to ensure that a country code, when present, is also valid (as was done with EAD2002, and will be done in the new approach)?
  3. ignore this bug altogether (outside of documenting it) since it likely does not impact anyone at all?

Another example: right now, the following is also valid in EAD3:

XX-1

Whereas that same fake code is correctly not valid in EAD2002 (though it is in EAC-CPF 1.0, which switched to a pure regex validation).

Creator of issue

1. 3. 4. 5.

The issue relates to

Wanted change/feature

Reporting a bug

Suggested Solution

Steps to Reproduce (for bugs)

1. 3. 4. 5.

Context

Your Environment can be a clue to a bug

fordmadox commented 2 years ago

Just to follow up, I tested reversing the first two characters of the current regex, and that does indeed fix the issue.

fordmadox commented 2 years ago

See https://github.com/SAA-SDT/eas-schematrons/commit/afe49d184b3d25d041557eab109eb5f47e9d9f37 for the patch.

I'm still planning to update this in the new Schematron to use the country codes, however.

kerstarno commented 2 years ago

Hi @fordmadox,

I agree that we should restrict a repository/maintenance agency code that is declared to be ISO 15511 compliant to maximal 16 characters. However, XX actually is a valid country code as it is part of the ranges that can be user-assigned. What it stands for might be different from one context to another, but against ISO 3166-1 it is valid.

"User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters AA, QM to QZ, XA to XZ, and ZZ, and the series AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available." (https://www.iso.org/glossary-for-iso-3166.html)

fordmadox commented 2 years ago

@kerstarno Regarding "XX" and ISO 3166, or any codes reserved for private use (e.g. 'qab' in ISO 639-2), I wonder if we should still flag those as invalid.

Given that there is no agreement about what those codes can represent, shouldn't we expect a user to record their usage within the control section, and also set the code list "otherCountryEncoding"?.

That country code (not to mention the numeric equivalents, and the 3-character user-assigned options) was never valid in EAD2002 nor EAD3... though any 2-character A-Z code could be used in the agency code heading in EAD3, which would make this type of error especially odd:

        <maintenanceagency countrycode="XX">
            <agencycode>XX-1</agencycode>
        </maintenanceagency>

Where the maintenanceagency element is invalid in EAD3, but the agencycode element is valid!

Quite the mixed message, there 😄

Also, it looks like the regular expression test in EAC 1.0 for country codes was limited to any 2-digit or 4-digit A-Z code.

Given all that, I do prefer EAD3's approach to the country code validation (not the ISIL one, though, due to the discrepancy highlighted above).

kerstarno commented 2 years ago

@fordmadox - I see your point about it not being clear what "XX" (or any other of these user assigned codes) stands for specifically, but they are part of the ISO 3166, so "otherCountryEncoding" would not necessarily be correct, I'd say.

Also, with the officially assigned codes we only check whether they are part of the ISO standard, we don't necessarily relate them to the appropriate country names, right? I mean, for validation, we don't really care, whether "XX" stands for "Country A" or "Country B", do we?

Maybe there's a possibility to let these codes validate, but to flag them as user-assigned? Same as we discussed with regard to deprecated codes?