TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
278 stars 88 forks source link

problems with content model and example of unicodeName #1804

Closed duncdrum closed 4 years ago

duncdrum commented 6 years ago

(unicode property name) contains the name of a registered Unicode normative or informative property.

this together with the relevant section in the Guidelines suggest that this element contents ought to contain a closed list of values. Yet, the content model of unicodeName is:

<content>
 <textNode/>
 <!--s:assert> some schematron rules here</s:assert-->
</content>

The upshot is that currently

<unicodeName>Stardate</unicodeName>
<value>24345.1</value>

validates fine. Despite the fact that Stardate is not a unicode property name.

Similarly, the example from the specs, does not reference a valid character property name:

<unicodeName>character-decomposition-mapping</unicodeName>

is the section heading explaining the property in the unicode standard.

According to the resources published by the UC

our example is not a specified property name or alias. The defined property name is:

<unicodeName>Decomposition_Mapping</unicodeName>

If we want folks to only use registered unicode property names inside unicodeName, I suggest to leverage these resources to actually enforce constraints in our schema.

This would entail:

Alternatively, we could

Caveats:

martindholmes commented 6 years ago

If I understand this correctly, the validation process would have to check the Unicode version number, retrieve and validate the content based on the allowed set of Unicode properties from that specific version.

I think there are significant implications for the scale of the processing and schema components here. In addition, there's potentially an issue with forward compatibility. As it stands now, I could create a schema from TEI 3.4, and without having to rebuild that schema, I would be able to incorporate future Unicode properties from as-yet-unreleased versions of Unicode without rebuilding my schema against a more recent version of TEI; that possibility would be lost if we fix Unicode versions in every release of TEI.

I rather think this might be overkill for the TEI. It's rather like trying to police all the values of @xml:lang according to whatever is the current version of the IANA Language Subtag Registry; it's a moving target and it would inevitably get neglected. I would prefer to see example code (on the wiki, or perhaps even in the Guidelines as a remark) that shows how to add this sort of validation to your own ODD, or to your project diagnostics (as we do with the subtag registry in our Endings Diagnostics toolkit https://github.com/projectEndings/diagnostics/tree/master/utilities).

duncdrum commented 6 years ago

I think there are significant implications for the scale of the processing and schema components here.

Yup, working on it, i d say 90% there.

As it stands now, I could create a schema from TEI 3.4, and without having to rebuild that schema, I would be able to incorporate future Unicode properties from as-yet-unreleased versions of Unicode without rebuilding my schema against a more recent version of TEI

True, but back when 3.4 your as-yet-unreleased versions of Unicode properties should go into localName not unicodeName they are not registered properties yet. If a new release of unicode introduces a new property, which does happen more frequently, you would be able to update your odd pointing to the new version if you so desire, either in your 3.4 customisation, or as part of an upgrade to 3.5.

It's rather like trying to police all the values of @xml:lang according to whatever is the current version of the IANA Language Subtag Registry

I don't think the comparison with the IANA tags is a good one. Firstly, we do currently validate @lang values, I want to do the same for unicodeName. I m not talking of matching and validating a charProp to the most recent character database entries, but doing something similar as what happens for lang tags, i.e. ensuring that en is actually a lang taggable string.
Browsers don't need to know what en refers to, in order to display all elements carrying that tag. But text processors need to know what Stardate is to enable sorting or regex matching for my PUA entity refs.

Looking at the change history I d say that chances of property names being removed or changed, are very low. Invalid TEI customisation would be the least of worries on the list of things that would break in such a scenario. They can be marked as deprecated, which means that no new entries for that tag will be added to the character database, but the property name will still be part of the schema and validate in subsequent tei releases. We could add single schematron rule that issues a warning for this scenario.

it's a moving target and it would inevitably get neglected

Adding the latest version of the files mentioned in the OP can be fully automated, so while it is a moving target, its one we can keep up with easily.

The only property that keeps changing with every new unicode release is the version number. As it stands now, @version on unicodeName is optional I think we can manage that, if folks don't supply a version number, then they ll just have to live with the fact that 3.5 now points to a new version (again a deprecated property name will not make their odd invalid).

Lastly, without any constraints there really isn't any difference between unicodeName and localName if tightening the reigns, by however much, is not in the offing, I'd suggest dropping the element, changing the guidelines to suggest <localName @type="unicode"> instead.

duncdrum commented 6 years ago

So as a question of procedure, the low hanging fruit are the examples and guidelines prose that use a unicode property name that simply doesn't exist. The commit linked above fixes these, without any functional changes. I can prepare a first PR, with just these changes, provided it gets a timely merge into develop. Whatever the ultimate cause of action regarding validation and constraints is, this commit should form the basis of potential future edits. Alternatively, i keep this as a separate commit, and pile on fixes and changes as we go along. @martinascholger @martindholmes which one should it be?

martindholmes commented 5 years ago

@duncdrum I think the best approach initially is to create a list of instances in the Guidelines where <unicodeName> is used to tag something which is not a Unicode Name. Those are straight bugs, and should be fixed. If you have an automated process for discovering those, that would be really handy to have, and we might run it as part of the TEI test suites.

Secondly, I'd like to see an example of the Schematron that would do validation of the element content, to see how complicated it ends up being.

duncdrum commented 5 years ago

@martindholmes re bugs, i can create a PR to remove the bugs, but while at it I would strongly suggest to also adopt unicode's naming convention for snake_case in all instances if <unicodeName>, as a better convention over configuration solution.

The test sort of depends on how open council is to utilize the unicode character database's schema for validation. Always using the latest release is straightforward, but it does introduce the possibility that a tei file fails to validate before a unicode update, and suddenly validates fine without a new tei release, after an update. (the other way around is not possible, because unicode doesn't remove properties when it discontinues them).

Alternatively, we can simply import the rng into the guidelines repo, which would avoid changes in validation results, but requires additional maintenance work for each release.

martindholmes commented 5 years ago

@duncdrum I would really like to get the bugs fixed initially, since that's uncontroversial; the other stuff will require some detailed discussion by Council, and will certainly take a while.

ebeshero commented 5 years ago

Council face-to-face discussed this: We're okay with dependence on an external unicode schema, if we can get it working!

duncdrum commented 5 years ago

So something else has come up: mapping. The recommendations in the Guidelines boil down to the following example:

 <mapping type="standard"> &#x4EBA; </mapping>
 <mapping type="PUA"> &#xE000; </mapping>

Now those character references are all fine and dandy, as long as you don't process TEI as part of a larger interchange thing with maybe multiple TEI files from different sources. XSLT and XQuery will expand character references that are within the scope of the document's character encoding (i.e.UTF-8). Which means that when you, e.g. query your collection of TEI documents to see if any two documents are assigning competing character or glyphs to the same PUA codepoint you see something like this:

Screenshot 2019-07-12 at 15 46 01

with TOFU for each PUA. It is in my view not very practical to suppose that any xml editing environment has access to the special fonts that would be used to render each document in its original context. Which wouldn't even solve the problem of multiple fonts and assignments for different documents. Its certainly not very human-readable.

My suggestion is to change the guidelines examples and to put the numerical codepoint reference into an attribute on mapping. This way, xml tools have easy access to the codepoint in question via inbuild functions.

<mapping type="standard" cp="20154"> &#x4EBA; </mapping>
<mapping type="PUA" cp="57344"> &#xE000; </mapping>

My main concern is scale. If I have more than one document, with 100s of PUA each, the TOFU overkill makes the mapping in the original examples less useful than it could be. Technically speaking no infomation is lost, and we are all doing spec compliant things, but I think this could be made to be more robust. Looking for alternative takes on this though.

martindholmes commented 5 years ago

I may be misunderstanding, but if the intention is to output something that looks like the character entity reference, rather than is a character entity reference, wouldn't you just escape it? &amp;#x4EBa;

duncdrum commented 5 years ago

Well I still consider putting an actual character entity reference into mapping to be good practice, for all the reasons and scenarios outlined in the guidelines, i. e. in cases where other encodings are in play. But any utf-8 xsl or xquery transform will expand them, so having a place for the numerical code point would be rather handy. You example is basically the result of using cdata for the contents which could be another mapping in egXml

martindholmes commented 5 years ago

You're right that there's a basic use/mention problem here. I like your solution with the @cp attribute, but we might want to consider whether then attribute name could simply be "codepoint" for more clarity. Would any other elements carry this attribute?

duncdrum commented 5 years ago

I m ok with @codepoint as far as charDecl elements go mapping is the immediate and only fit for the new attribute as far as i can tell. I could imagine uses on g directly, but for now would prefer to be conservative and discuss them if anybody expresses a use case for them. I ll have to play with it a bit more with character clusters (which is the other documented usage in the guidelines for character entities inside mapping). The new attribute would certainly make the Phase3 of #1805 conversion script more robust so thats another plus from my side. Without adding a special attribute we should at least add another mapping to all examples that use character entities. So I see three options:

1

<mapping type="standard"> &#x4EBA; </mapping>
<mapping type="codepoint">20154</mapping>

2

<mapping type="standard" codepoint="20154"> &#x4EBA; </mapping>

3

 <mapping type="standard"> &#x4EBA; </mapping>
 <mapping type="cdata"><![CDATA[&#x4EBA;]]></mapping>

so far i like 2 the best.

martindholmes commented 5 years ago

This will take some discussion, I think. If we assume that the use/mention problem will be fundamental to the enterprise of documenting stuff like this (and bearing in mind similar issues like standardized variation sequences), we should make sure the solution is solid and extensible. I think I like #2 best, but other people should definitely weigh in here. #2 seems cleanest in the sense that it separates the entity in the text node from the numerical value in the attribute, and ensures that there's no possibility that the latter is processed, while leaving the option for the former to be rendered correctly if the rendering system is up to it.

duncdrum commented 5 years ago

Yes, so to facilitate the discussion, suppose i have 10-100 PUA character mapped according to the current guidelines in my document. Now my boss wants me to change <persName> into <name type="person">, i use xsl/xquery to do so. The result will be 10-100x:

<mapping type="PUA"></mapping>

The documentation prose for Ideographic variants has already been changed in #1901 I don't think that is affected by this. Other variation sequences, aren't in the guidelines yet, maybe the should be?

martindholmes commented 5 years ago

The question really is: is there any meaningful difference between a character entity reference and a character? If not, then it makes no difference which appears, and there's nothing wrong with your tofu (except that you would have to slightly go out of your way if you happened to want to know the codepoint of a tofu char). If there is a fundamental difference, then XML processors surely shouldn't convert them without permission. But your @codepoint solves the problem, I think. And we could make it mandatory.

duncdrum commented 5 years ago

Funnily, it depends who you ask. Xpath should not expand references, XSL/Xquery should. So to the former they are not identical, but they are to the latter two, go figures.

Because XQuery expands predefined entity references and character references and XPath does not, […]

Xquery Specs XML Specs

martindholmes commented 5 years ago

I must admit I find this: https://www.w3.org/TR/2008/REC-xml-20081126/#entproc utterly incomprehensible. My vague understanding has always been that an XML processor (as opposed to XPath, XSLT or XQuery) would normally expand numeric entity references on parsing, and therefore that by the time they reached XPath/XSLT/XQuery they would be regular codepoints, but I guess that's just ignorant and simplistic.

duncdrum commented 5 years ago

Yup, experiencing a certain sense of naiveté seems to be par for the course.

duncdrum commented 5 years ago

re @codepointlooks like i m not alone in wishing it to exist not-my-files This also demonstrates the problem of scale this is 122 char just from one document, now imagine having to deal with 10, 50, 100, … documents from different sources.