TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
282 stars 83 forks source link

listPrefixDef and prefixDef – problems with wording of prose and usage #2001

Open dariok opened 4 years ago

dariok commented 4 years ago

Reading @lb42 ’s question about prefixDef on TEI-L, I remembered a lengthy discussion about this and found that “prefixDef” actually is somewhat of a misnomer which, in this discussion, lead to quite a bit of confusion. The other party in the discussion, from the term “prefixDef” took that it was meant to designate a namespace prefix.

Of course, we are not talking about a prefix but about a locally defined, private URI scheme. The GL, though, are less specific which created the confusion: section 16.2.3 initially uses the term “private URI scheme” or simply “scheme”, but the short definitions of the two elements introduce the term “prefixing scheme” – possibly trying to reflect prefixDef – and the prose section mixes “prefix” in quite frequently.

The definitions and prose should not include the word “prefix” when referring to the scheme as RFC 3986 does not use the term “prefix” in a well-defined manner. Hence, we should stick to what is well-defined in the RFC. which is the term “scheme”. It should be made explicit that in the GL’s example of "psn:fred", “psn” is a URI scheme and “fred” a URI path.

While I don’t suggest we rename prefixDef/listPrefixDef (maybe keep that for P6 😉), we should adjust the wording of their short definition to use “private URI scheme” instead of “prefixing scheme”.

--

Another problem comes up when looking at an expansion of the example I gave on TEI-L, which was of the form

gnd:1234567

which may have 2 expansions, one for a (project- or repository-)local resolver and one for the defining authorities endpoint (in this example, Gemeinsame Normdatei by the German national library, so the expanded URL would be https://d-nb.de/gnd/1234567).

Now this is easy to resolve but has one logic problem: “gnd”, strictly speaking, is not a scheme but actually an authority for which there is a dedicated place in a URI:

per://gnd/1234567

per being the (private) scheme denoting the type of entity; gnd being the authority; 1234567 being the path, here the identifier within the authority.

This follows RFC 3986 even more closely than a simple “gnd:1234567” would do as “gnd” indeed is the authority defining the identifier given in the path component. But it has a downside, too, which is parseability into the 2 different expansions (one being a local expansion that returns information based on the string “gnd:1234567”, the other being the authority’s URL of ).

Expansion 1 is easy to achieve from both short and long versions, but the second, generic expansion will not be possible: The pefixDef mechanism treats everything after the (first) colon as being the same thing while actually RFC 3986 sees 2 different components here. I don’t think we urgently need to change anything here but we should mention this in the GL.

--

At this point, however, we come back to @lb42 ’s question of providing some form of central registry or place of dealing with private URIs. RFC 3986 (page 19f.) actually allows for URI schemes to require a specific way of resolving (the registered name making up the host part of) the authority component of the URI. So there actually is a basis for providing a lookup though that might include some work to enable the prefixDef mechanism to distinguish between the authority and path components and actually hint at resolution.

--

Sorry for this long one but I'm sure we'll have a fruitful discussion possibly leading to a few improvements in the GL.

martindholmes commented 4 years ago

Hi Dario,

Mea culpa there, since I was largely responsible for these elements. But the RFC for URI schemes explicitly uses the term "prefix" when talking about private URI schemes:

https://tools.ietf.org/html/rfc7595

We could rename the elements pusDef and listPusDef, but those are pretty nasty-sounding names. :-)

Cheers, Martin

On 2020-05-27 10:03 a.m., Dario Kampkaspar wrote:

Reading @lb42 https://github.com/lb42 ’s question about |prefixDef| on TEI-L, I remembered a lengthy discussion about this and found that “prefixDef” actually is somewhat of a misnomer which, in this discussion, lead to quite a bit of confusion. The other party in the discussion, from the term “prefixDef” took that it was meant to designate a namespace prefix.

Of course, we are not talking about a /prefix/ but about a locally defined, /private URI scheme/. The GL, though, are less specific which created the confusion: section 16.2.3 https://tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAPU initially uses the term “private URI scheme” or simply “scheme”, but the short definitions of the two elements introduce the term “prefixing scheme” – possibly trying to reflect prefixDef – and the prose section mixes “prefix” in quite frequently.

The definitions and prose should not include the word “prefix” when referring to the scheme as RFC 3986 does not use the term “prefix” in a well-defined manner. Hence, we should stick to what is well-defined in the RFC. which is the term “scheme”. It should be made explicit that in the GL’s example of "psn:fred", “psn” is a URI scheme and “fred” a URI path.

While I don’t suggest we rename prefixDef/listPrefixDef (maybe keep that for P6 😉), we should adjust the wording of their short definition to use “private URI scheme” instead of “prefixing scheme”.

--

Another problem comes up when looking at an expansion of the example I gave on TEI-L, which was of the form

|gnd:1234567 |

which may have 2 expansions, one for a (project- or repository-)local resolver and one for the defining authorities endpoint (in this example, /Gemeinsame Normdatei/ by the German national library, so the expanded URL would be https://d-nb.de/gnd/1234567).

Now this is easy to resolve but has one logic problem: “gnd”, strictly speaking, is not a scheme but actually an authority for which there is a dedicated place in a URI:

|per://gnd/1234567 |

|per| being the (private) /scheme/ denoting the type of entity; |gnd| being the /authority/; |1234567| being the /path/, here the identifier within the authority.

This follows RFC 3986 even more closely than a simple “gnd:1234567” would do as “gnd” indeed is the /authority/ defining the identifier given in the /path/ component. But it has a downside, too, which is parseability into the 2 different expansions (one being a local expansion that returns information based on the string “gnd:1234567”, the other being the authority’s URL of ).

Expansion 1 is easy to achieve from both short and long versions, but the second, generic expansion will not be possible: The |pefixDef| mechanism treats everything after the (first) colon as being the same thing while actually RFC 3986 sees 2 different components here. I don’t think we urgently need to change anything here but we should mention this in the GL.

--

At this point, however, we come back to @lb42 https://github.com/lb42 ’s question of providing some form of central registry or place of dealing with private URIs. RFC 3986 (page 19f.) actually allows for URI schemes to require a specific way of resolving (the registered name making up the /host/ part of) the /authority/ component of the URI. So there actually is a basis for providing a lookup though that might include some work to enable the |prefixDef| mechanism to distinguish between the /authority/ and /path/ components and actually hint at resolution.

--

Sorry for this long one but I'm sure we'll have a fruitful discussion possibly leading to a few improvements in the GL.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TEIC/TEI/issues/2001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASNASJVNF33JNOMMUXKKMLRTVBULANCNFSM4NMKLQ7A.

--

Humanities Computing and Media Centre University of Victoria mholmes@uvic.ca

jamescummings commented 4 years ago

It doesn't seem like renaming elements is needed (certainly not those names!) but it could maybe be clarified slightly in the prose.

martindholmes commented 4 years ago

On the issue of a central registry, the RFC specifies that you should use your domain name, reversed, as the prefix; so we would have:

ca.uvic.mapoflondon.prs

as the prefix for "person" in the Map of London project. That completely undermines the value of specifying a short prefix to simplify mapping within your project, so I think it's pointless. The idea is that we specify our prefixDefs inside our projects, and we resolve them in any version of a document intended for use outside that context.

I would like to be able to provide, though, a simple pointer from somewhere in the header of a TEI document to a central file in which all manner of such things could be found. It would make truly mechanical processing feasible. Something like:

<teiHeader metadataLocation="./../../centralMetadata.xml">

And then all your prefixDefs and other such things go in there. Any processor would know to look there for anything that it can't find in the current file.

lb42 commented 4 years ago

If your project is organized as a TEI corpus, with a teiCorpus element wrapping all its constituent components, the logical place for listPrefixDef (and much else besides) is the outermost teiHeader, aka "the corpus header". I realise that not everyone is comfortable with this hierarchic structure, but that's how the TEI was originally designed, and for those who are comfortable with the notion, it works well. On the naming question, Dario makes a good argument for not thinking of the prefix as anything other than a (hm) prefix, with no particularl webby semantics. It's unfortunate that the word is also appropriated for namespace labelling, but that's life: like causes beget like effects. We need a way of specifying a short cut label both for namespaces, and for URLs.

martindholmes commented 4 years ago

@lb42 I'm talking about projects with many thousands of TEI files; it's not really practical to put the entirety of such a project inside a single teiCorpus element.

When I think a little more about the other issue, I wonder if it's really meaningful to say that these are private URI schemes in the true sense of the word; they make use of the prefixing syntax to create things which are valid URIs, and can therefore be used in pointing elements, but they're really not that different from magic keys.

lb42 commented 4 years ago

@martindholmes, I agree with the second point, but less with the first one.

martindholmes commented 4 years ago

@lb42 Would you suggest maintaining a single teiCorpus file with 8,000 XIncludes, which has to be expanded and processed as a whole in order to resolve the prefixDefs? That seems pretty inconvenient.

lb42 commented 4 years ago

It depends what you want to do in your processing, doesn't it? If you want to make sure that the xml:id values are unique across the whole corpus, you will need to process the whole thing. If you want to produce a display version of some subset of it, you won't. Expanding prefixDefs is probably more important for the second than the first task. So I would imagine a kind of environment in which you generate a driver file that xincludes the corpus header and the files you want to process on each occasion. Is it really convenient to require each file to point off to some separate universe from which the global metadata can be retrieved?

martindholmes commented 4 years ago

@lb42 You wouldn't require such a thing; you would only use it if you need it.

ebeshero commented 4 years ago

Council VF2F: Work on clarifying the existing language and review again.