RNG schema generation uses semantically unsound method of removing dangling references from content models

cmsmcq commented 7 years ago

The generation of RelaxNG schemas from ODD documents removes from content models all references to undefined elements, without appropriate consideration of the context. The result is potentially a semantically inappropriate schema. It appears that the incorrect behavior appears for elements for which the ODD contains an elementSpec with mode="delete", elements omitted from the include attribute on a moduleRef, and elements listed in the exclude attribute on a moduleRef.

A lengthy discussion of the topic, intermingled with some other topics, can be found in issue 1589 on the TEI repository. I'm opening this issue because the error in the generation of RNG schemas (and thus in all other schemas which are generated from the RNG schemas) and the error in the text of P5 can be fixed independently.

The discussion of the other issue suggests that some confusions and doubts should be addressed. Those who find the description above sufficient need not read further.

Assumptions about the meaning of element suppression

The key assumption in this bug report is that when an ODD suppresses an element E, the schema defined by the ODD should accept as valid all documents valid against the original schema which do not contain any instances of E. Some participants in the discussion of issue 1589 have apparently had some trouble accepting this as the correct behavior; some have suggested privately that the current behavior of simply removing all references to E without regard for context has an equal claim to correctness. The following paragraphs attempt to address this suggestion.

Two conflicting views of the semantics of element suppression

First, consider the choice between the two views offered on the meaning of mode="delete".

One view defines the meaning in terms of the meaning of the schema, as a change to the language to be defined. It does not prescribe the specific changes to be made to the text of the schema; any set of changes which has the required semantic effect is a correct implementation of the semantics. There are several different ways to operationalize the change, including one which is very simple to implement.
The other view defines the meaning of mode="delete" operationally, in terms of changes to the text of the schema. What is its effect on the language defined by the schema? As the examples below will show, the change implemented by odd2relax.xsl are not guaranteed to restrict the language; some but not all of them will extend the language. I have not yet found a reliable characterization of the effect of the current behavior of the Relax NG generator in terms of the language defined, and none has thus far been offered.

I think an operator defined in terms of its semantics and easily implementable is likely to be more useful than an operator without a clear semantics, defined solely in terms of textual operations. So it seems to me that the meaning of element suppression in ODDs should be that of the first view identified above.

The motivating use case (the Mylonas principle)

Second, consider the basic use case. Facilities for suppressing elements of the TEI vocabulary were introduced in part at the insistence of projects who did not wish to make use of certain elements defined by TEI. When they were told "That's fine, you don't have to", they responded "That's not enough. It's not just that we don't mark personal names. Any occurrence of a person element in our texts is an error, and we want the schema to catch the error. That's part of what schemas are for." The goal, that is, was not to accept as valid any documents which would be invalid against the vanilla TEI grammar, but to enforce all of vanilla TEI's criteria of validity, while adding the criterion that certain elements should be absent. I think of this as the Mylonas Principle, since it was Elli Mylonas who successfully argued that the customization mechanisms should allow such restrictions of vanilla TEI.

That is, the motivating use case for including element suppression among the defined mechanisms for TEI customization favors the first, not the second, view of the required effect.

Functional compatibility of ODDs with P3 mechanisms

Third, consider the history of TEI customization mechanisms. In TEI P3, suppression of elements led to the omission of declarations for the elements concerned; no changes were undertaken to content models in which those elements were referred to. The resulting DTD is conforming both under the rules of ISO 8879 and those of XML, and the meaning of the resulting schema is consistent with the first of the two views above but not with the second. The meaning of DTDs with dangling references is parallel to the interpretation of context-free grammars containing undefined non-terminals.

The introduction of ODD documents (originally part of the publication system developed for TEI P2 and P3) as a replacement for the DTD-based customization mechanisms developed by the Metalanguage Committee for TEI P3 was (as I understand it) intended to make the customization mechanism schema-language neutral (or at least polyglot) and to replicate and improve on the mechanisms of P3. If a change in the meaning of the element-suppression operator was consciously intended, it will have been called out as a significant change, and discussed, and justified as an improvement. I haven't heard of any such discussion; I infer that no conscious change of semantics was intended.

Examples

Cases with the same results

Note that in some cases, the two views of the element-suppression operator produce the same results. Let us assume that both element E and element F are to be deleted.

A content model reading (a | b | E) will be rewritten (a | b).
A content model reading (a | E | F) will be written (a).
Similarly, (a, E?, b) will be rewritten (a, b).
(E?) and (E*) will be rewritten empty.

Cases with different results

In other cases, the two views will produce different results.

For ( (a, E, b) | p+ ), the first view will produce (p+), and the second ((a, b) | p+).
The first view preserves the invariant of the original model, that if a is used, then E and b are also used; the second violates the invariant.
For (E), the first view will produce notAllowed, the second empty.
The first view again preserves the invariant that every valid instance of the parent contains exactly one instance of E -- since our target schema allows no instances of E, there can be no valid instances of the parent. The second view discards the invariant and allows empty instances of the parent.
For (E | F), the first view will produce notAllowed, the second empty.
For (a | (E | F)), the first view produces (a | notAllowed) and then after simplification (a). The second view produces (a | empty) and then after simplification (a?) Note that (a | E | F) and (a | (E | F)) recognize the same language, but the second view produces different results for them. This suggests it may be challenging to find a coherent semantic description of its effect on the language being recognized.

[Examples copy-edited 27 February, 10:30 Mountain Time.]

hcayless commented 7 years ago

I have two concerns here:

The first may only be theoretical. Will we ever run into problems with cross-module dependencies? I.e. module A contains an element x whose content model makes a direct (not via a class) reference to an element y in module B. An ODD user creates an ODD that drops module B. What happens to the content model of element x? The current reference-dropping mechanism would treat this in the same way as if element y were deleted, but I think the proposed one wouldn't cope with it, because from the perspective of the compiled ODD, the element definition would actually be gone rather than simply disallowed. I'm not sure if this actually happens, and if it does, the solution might be to have a more formal way to handle dependencies—if module A depends on module B, then it should be an error to have your ODD include module A but drop module B.
This one's not theoretical. The proposed change risks breaking ODDs that are currently perfectly legitimate (even if @cmsmcq and @lb42 would argue they are incorrect). For example, if I take TEI All and drop the <teiHeader> from it, I can generate a Relax NG schema that lets me have

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <text>
    ...
  </text>
</TEI>

If we make this change to the processor and I do nothing but attempt to regenerate a Relax NG schema with the updated processor, I will get a (from my perspective) broken schema that doesn't validate anything, because the content model of <TEI> is now unsatisfiable. So if we do this, I think we will have to either figure out how to politely transition to the new functionality (have some sort of deprecation period, and/or add the ability to switch between old and new behaviors), or properly warn our header-removing user that they will need to go modify the content model of <TEI> or create their own "TEI without the header" element in a different namespace.

If a change in the meaning of the element-suppression operator was consciously intended, it will have been called out as a significant change, and discussed, and justified as an improvement. I haven't heard of any such discussion; I infer that no conscious change of semantics was intended.

I'm not especially confident in this inference. But it happened well before my time. @lb42, can you shed any light on whether the current behavior is the result of a mistake or was a deliberate change in policy?

lb42 commented 7 years ago

Cross module dependency doesnt seem to me at all like a real issue. Modules are just macros. You can define an ODD with no reference to modules at all. I think we have two cases where an element in one module refers to one in another, but that shouldn't cause any problem. Depends on what your @source says.
If you want a schema which allows you to drop the header, make a version of TEI all in which <text> is one of the possible root elements. But do you have evidence of people wanting to make such a crazy modification who wouldn't be satisfied by that ?

My belief, on absolutely no evidence as I am currently engaged in teaching an ODD workshop en francais, is that the current behaviour is an unintended consequence of changes to the stylesheets. Finding out at what point in time the particular change concerned happened is a theoretically feasible but I would say not very useful exercise.

cmsmcq commented 7 years ago

The first may only be theoretical. Will we ever run into problems with cross-module dependencies? I.e. module A contains an element x whose content model makes a direct (not via a class) reference to an element y in module B.

The current code treats these in the same way as an explicit element spec with mode="delete": odd2odd.xsl emits an ODD in which elements (and classes, and data specifications, and ...) excluded by the user's odd in whatever way are not present. The odd2relax.xsl stylesheet generates define elements for the elements (and classes, and ...) present in its input, and then removes references to undefined patterns.

Other things being equal, I think the correct fix would be to continue to treat the various ways of excluding or not including an element as equivalent; this was not my first thought (as reflected in a comment on bug TEI 1589, but further examination of the source code and Lou's reassurance about potential interference with other idioms have led me to the belief that the simplest fix is simply to change the way odd2relax.xsl handles references to undefined patterns.

The proposed change risks breaking ODDs that are currently perfectly legitimate (even if @cmsmcq and @lb42 would argue they are incorrect).

From where I'm sitting, the change is not breaking a legitimate ODD but making the system produce the schema actually defined by the ODD instead of a different schema. Users who wish to change the content model of TEI to (tei_text) can continue to do so by redefining the content model.

As a matter of interest, are you aware of ODDs used in production systems of projects that eliminate teiHeader without redefining the content model of TEI or teiCorpus (or other mandatory or mandatory-in-context elements, without redefining the content model of the parent), or is the case a theoretical one (in the sense that since the vast majority of ODD files used in production system are not known to any single individual or group, there is no way of telling how many might be doing this, or what their maintainers thought the semantics were supposed to be)? Given what I read as the conflict between the documentation of the ODD elements in chapter 22 and the behavior of the Relax NG generator, it's not hard to imagine that some users will assume that the software is correct. Others (like me) will have assumed that the documentation is by definition correct and binding. Those projects who use DTDs will of course not have complained, since the DTD generation code does not have the same semantic behavior as the RNG generation code.

cmsmcq commented 7 years ago

I think we will have to either figure out how to politely transition to the new functionality (have some sort of deprecation period, and/or add the ability to switch between old and new behaviors), or properly warn our header-removing user that they will need to go modify the content model of <TEI> or create their own "TEI without the header" element in a different namespace.

Yes; it would be desirable to inform the user that the selection of elements in the ODD produces a schema with some elements which are unsatisfiable but not excluded. That will frequently be something the user wishes to change. It is (I think) easy enough to make the odd2relax stylesheet emit such a warning; how complicated it might be to make the Roma interface convey that warning to the user I do not know.

hcayless commented 7 years ago

From where I'm sitting, the change is not breaking a legitimate ODD but making the system produce the schema actually defined by the ODD instead of a different schema. Users who wish to change the content model of TEI to (tei_text) can continue to do so by redefining the content model.

I understand that's how you see it, but as you say, we may have users who assume the software is currently working correctly. So while we should fix it, we do need to take some care.

I don't offhand know if we have any users who are doing things that Lou would call crazy. We may not.

@lb42 remarks

Cross module dependency doesnt seem to me at all like a real issue. Modules are just macros. You can define an ODD with no reference to modules at all. I think we have two cases where an element in one module refers to one in another, but that shouldn't cause any problem. Depends on what your @source says.

I'm confused by this. Am I wrong that the only way you get definitions for elements in a module in your compiled ODD is to reference that module? If I compile tei_lite, I don't find an elementSpec for <app> in the compiled ODD. So I'd have thought not referencing modules was another way to "delete" elements. You indicate that this does happen, albeit (thankfully) rarely. Maybe we can keep the step where dangling references are removed, since after the processor is corrected the only instances of those will be due to unreferenced modules?

cmsmcq commented 7 years ago

So I'd have thought not referencing modules was another way to "delete" elements.

I think that's true, with one slightly pedantic correction. One can use an elementRef to refer to an element (normally to include it), or an elementSpec to (re)define it, without referring to the module containing it. So if one wants to exclude element blort, it's not enough to refrain from referring to the module in which blort is defined; one must also refrain from including an elementRef or elementSpec for blort.

I understand Lou's remark about modules being macros to be referring to the fact that including <moduleRef key="linking"/> in an ODD has the same effect as, and can be regarded a shorthand for, including something like the following in the ODD.

<elementRef key="ab"/>
<elementRef key="alt"/>
<elementRef key="altGrp"/>
...
<elementRef key="when"/>

And <moduleRef key="linking" include="seg"/> is extensionally equivalent (if I understand chapter 22 correctly) to <elementRef key="seg"/>.

All the mechanisms mentioned so far:

referring to the containing module with an include attribute, but not naming element E in the value of that attribute
referring to the containing module with an exclude attribute, and naming element E in the value of that attribute
providing an elementSpec with mode="delete"
referring neither to the containing module nor to the specific element E

have (as far as I have thus far seen) the same effect in the output of odd2odd.xsl: no elementSpec for E is included. If there is any way to distinguish the four mechanisms indicated (and any others that may yet occur to us, that have the same effect) in the output of odd2odd.xsl, I have not seen it yet. I believe that the current code is correct in treating all of these in the same way; I think it would be a mistake to treat omission of element E in one way for cases 1-3 of the list just given and a different way for case 4. (If I've misunderstood, and HC is not advocating such a different treatment, then I apologize for misunderstanding his point.)

I believe the cases in which content models in one module refer to elements in other modules are much more common than Lou indicates, but perhaps he is excluding references to classes with members from other modules. Striking examples of cross-module references Include the references to teiHeader (from module header) in the models of TEI (module textstructure) and teiCorpus (module core).

hcayless commented 7 years ago

I'm not advocating for anything. I'm just pointing out the existence of case 4 and wanting to make certain it is handled too.

lb42 commented 7 years ago

I am not sure how an ODD can usefully be considered "legitimate" if it is also "incorrect" but let that pass. I just wanted to confirm that (a) you can write an ODD containing only elementRefs and pull in a whole bunch of element specs if your @source attribute indicates a file that contains those specs. It wont be much use unless you include the module tei, however, since that's where the majority of class definitions hang out, and without them the content models of the elements you've referenced won't mean very much. (b) yes, I was not considering references to classes, since a reference to a class which is not populated is entirely benign in the construction of a schema.

lb42 commented 7 years ago

Backwards compatibility alert #12 & #35. We tell people that if they have an ODD which worked with TEI P5 release x.y, where x and y are some way back in the past, the ODD processor will continue to generate the same schemas as it did when that version was current if they specify x.y as the version of the source against which an ODD is to be compiled. The assumption is that the processor will treat old specs the same way as it always did, so all you need do is to give it the right mixture of old specs. If we now change the way the processor behaves, obviously that is no longer true. Or may no longer be true.

martindholmes commented 7 years ago

@lb42 That's why I've always believed that the decision to split the TEI and Stylesheets repos apart was a mistake. The only way to remedy it really is to find a way to bind each TEI release to a specific Stylesheets release. Or put them back together again.

lb42 commented 7 years ago

@martindholmes well, we can always work out which stylesheet release corresponded with which TEI release, I suppose. But that might be counterproductive : I might want to go on using old specs, but benefit from super efficiency updates in the stylesheets. (I speak purely hypothetically of course)

hcayless commented 7 years ago

This is the point I was making above. I think it's probably solvable by essentially having a deprecation period for the old behavior, but we can't just make the change unless we're confident it won't break any ODDs.

martindholmes commented 7 years ago

So we want an impossible situation: all Stylesheet releases work with all TEI releases, and vice versa. What happens if I want a bugfix from Stylesheets version 7.50, but I need processing for an element that was removed long before that version was released?

In a perfect world (or maybe in P6), this is what I'd like:

One repo, including specs, guidelines and processing
Major version releases from that repo, each one being a branch
Bugfix releases to each major version released along that branch, but with no new features or other changes.

That way, if you're using TEI version 3.5, then you can check out the latest release in that branch to get all the bugfixes for it and process your files successfully; when you want new features, you have to migrate to the appropriate later version, and see explicit error messages or warnings when your code is incompatible, so you can update it. If you need to stay on the old branch but there's a bug, it can get reported and fixed, with fixes ported to other branches if appropriate.

lb42 commented 7 years ago

Just a general comment on the current ODD processor: it really doesn't do error reporting. The code is (as they say) gnarly and occasionally redundant or even just plain inexplicable. Finding out why it behaves the way it does is often hard. So I tend to think that the benefits of any concerted effort to improve on its behaviour considerably outweigh the possible downside of some old behaviours disappearing.

cmsmcq commented 7 years ago

On Mar 8, 2017, at 9:53 AM, Lou notifications@github.com wrote:

Backwards compatibility alert #15 and #35. We tell people that if they have an ODD which worked with TEI P5 release x.y, where x and y are some way back in the past, the ODD processor will continue to generate the same schemas as it did when that version was current if they specify x.y as the version of the source against which an ODD is to be compiled. The assumption is that the processor will treat old specs the same way as it always did, so all you need do is to give it the right mixture of old specs. If we now change the way the processor behaves, obviously that is no longer true. Or may no longer be true.

Perhaps one way to address this would be as follows. Unless I see a better proposal here before I get around to actually making the changes, I will prepare the changes to the stylesheets thus:

1 Add a parameter (I’ll call it $sgd here, for 'subset guarantee on deletion’) to the appropriate stylesheets (and anything else that needs to know).

2 Default value of $sgd depends on the indicated source. If it’s the version current now, or an earlier one, then the default value is false; if it’s a later version, the default value is true.

3 If $sgd is false, the existing behavior is grandfathered in for Relax NG schemas; if it’s true, then deletion (= omission) of an element will have the subset semantics described in the bug reports, which is what the DTD processing now has.

4 I expect that for debugging purposes if no others I may add a parameter controlling whether unsatisfiable content model expressions should be simplified or not. Default value should be to simplify.

A consequence of this design is that the compatibility guarantee is preserved, but the subset-semantics behavior can be obtained immediately by those who don’t want the current behavior, without waiting for a new release.

Side question, for the compatibility lawyers in the crowd: How exactly should the stylesheet sniff out the version number?

One might consult the tei:schemaSpec/@source attribute, look for a value matching the pattern "http://www.tei-c.org/Vault/P5/" + x + "." + y + "." + z + "/xml/tei/odd/p" + p + "subset.xml", and take the values x, y, and z from there. When the source is a local document, or a resource which doesn't match that pattern, this approach won't work (so the stylesheet will need to have a default value for the case when it cannot identify a version number for the source).
Instead of consulting the tei:schemaSpec 'source' attribute in the ODD being processed, one might instead consult the TEI document it points at. Assuming that $ED = /tei:TEI /tei:teiHeader[1] /tei:fileDesc /tei:editionStmt in this document, one could check for

$ED/tei:ref[contains(@target,’readme’)]/string() (: then tokenize() the string on ‘.’ and compare Dewey-decimal style against (3, 1, 0) :)

or

$ED/tei:date/@when
(: if le ‘2016-12-15’ then default to false(), else to true() :)

Are there other better approaches?

[Edited 8 March 2017 at 2 pm Mountain Time to try to eliminate potential confusion about which document's TEI header is being examined looking for a TEI version number.]

sydb commented 7 years ago

I think, @cmsmcq , you have hit it exactly right, except for one tweak: the @source of <schemaSpec> can also be (and one might say should be) of the form "tei:3.1.0" or "tei:current".

lb42 commented 7 years ago

One might even say that (and this is the most likely case) if the @source is not specified, then the default source assumed has to be tei:current -- as of the date the ODD was last revised. There's a table somewhere mapping TEI Release numbers to dates, for keen shedders of velocipedes.

TEIC / Stylesheets