biocodellc / biocode-fims-commons

Biocode Field Information Management System
3 stars 0 forks source link

validURI characters #36

Open rodney757 opened 6 years ago

rodney757 commented 6 years ago

After doing some research on N2T and various browser tests, here should be the allowed characters for URIs (I think should constrain the identifiers to the "allowed" set, rather than looking for "excluded" characters))

A-Z a-z 0-9 -_.:=+

Everything else is unpredictable in an identifier, and either conflicts with ARK or DOI rules or would interfere with REST-based URL parsing systems. We should also examine this set to see how it behaves with the current set of identifiers in the system.

rodney757 commented 6 years ago

The following set, in addition to a-z;A-Z;0-9 works properly with N2T forwarding and does not interfere with ARK reserved meanings: ()_:=+

It turns out -(dash) gets stripped by N2T (i didn't notice this before) and () (parantheses) are actually OK.

These two characters also work in conjunction with N2T but will have mangled interpretation if ever turned into ARK identifiers: /.

As far as fixing previously accepted identifiers, the behaviour across N2T and any downstream REST services is pretty dicey and upredictable when it comes to inserting encodings. Its far safer, from my tests using curl to ONLY use approved characters.

rodney757 commented 6 years ago

Here is an explanation from the EZID team about dashes. In short, if we want to use dashes in suffixes through the N2T resolver we can't expect them to come out unscathed.

_Yes, it's more or less intentional in the sense that ARKs are defined so that hyphens are "identity inert" (by analogy with phone numbers, 1-800-555-1212 should not be considered distinct from 18005551212).

I said "more or less" because I think normalization should be applied at end points (eg, your receiving resolvers) rather than imposed by intermediary resolvers (like n2t). The real reason it's happening in this case is that n2t works by first looking up the identifier verbatim, and failing to find it, it will then normalize according to the id type (eg, ARK if if begins with "ark:") and look it up again.

Failing that second lookup, n2t applies a number of tricks to figure out what to do with the id, but it applies them to the normalized identifier and to its normalized parts. Hyphens are never touched in query strings, but suffix parts to the left of a query string get normalized according to ARK scheme rules before the suffix passthrough trick is applied.

So, unfortunately, to get ark ids working with the hyphen in them, either your end resolvers would have to handle the hyphen-less forms (which I think is the best long term strategy) or you'd have to register each individual ark-with-hyphen in n2t._