TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
279 stars 88 forks source link

Multiple `<attDef>`s with the same `@ident` in the same `<attList>` should be invalid #2371

Open martindholmes opened 1 year ago

martindholmes commented 1 year ago

In the ATOP work, we came across an in-vivo ODD file which has this content:

<attList>
  <attDef ident="level" mode="change">
    <valList mode="add" type="closed">
      <valItem ident="j"/>
      <valItem ident="m"/>
    </valList>
  </attDef>
  <attDef ident="type" mode="change">
    <valList mode="add" type="closed">
      <valItem ident="italic"/>
    </valList>
  </attDef>
  <attDef ident="xml:id" mode="delete"/>
  <attDef ident="n" mode="delete"/>
  <attDef ident="facs" mode="delete"/>
  <attDef ident="key" mode="delete"/>
  <attDef ident="type" mode="change">
    <valList mode="add" type="closed">
      <valItem ident="italic"/>
    </valList>
  </attDef>
</attList>

The repetition of the @type attDef is clearly an error, but it's not invalid against tei_odds or tei_all. I propose introducing a Schematron rule to prevent this. In cases where multiple definitions are not identical, then a significant burden would be placed on a) maintainers of the ODD specification to specify exactly how they bear on each other and how they are expected to be combined during processing, and b) maintainers of processing tools. Forcing ODD writers to consolidate their modifications to a single attribute into a single <attDef> seems far preferable. A simple Schematron rule would constrain this.

lb42 commented 1 year ago

I am not sure I agree with this, though it's an obvious optimisation that any implementer might choose to make, and I have no quarrel with your doing so in ATOP. But for the record, the reason it's like that is that I at least thought you might well want to take several bites at the definition of an attribute in an ODD chaining context. For example, you might define a base set of @type values in your "mother" ODD, and then add different sets of extra values in ODDs which are chained to it (or do I mean from it). Yes, of course you risk shooting yourself in the foot by making contradictory assertions for the same attribute, but that happens elsewhere too. If you're clever enough to chain your ODDs you're probably clever enough to notice such errors.

sydb commented 1 year ago

@lb42 — Your argument seems quite reasonable to support a …/elementSpec[@ident eq 'duck']/attList/attDef[@ident eq 'quack'][@mode eq 'change'] in both the ODDs for customization A and for customization B, which might well be chained in one direction or the other. But (I think) @martindholmes is arguing against having 2 cases of that XPath in the same <schemaSpec>.

sydb commented 1 year ago

I should add that I am not 100% convinced @martindholmes is right (that having 2 occurrences of that same XPath within the same <schemaSpec> is far more trouble than it is worth), only 90%.

That is because the only reason I can come up with on-the-spot for allowing this would be so that users could group all of their <gloss>es, all of their <desc>s, and all of their <valList>s (or whatever) for a set of attributes together; or maybe group all of the English glosses, descriptions, and values in one <attDef>, all the Italian ones in another, etc. While that is an obvious advantage, I think it is far outweighed by the difficulties imposed on the specification writers, those who write processors, and users that don’t do that and would have preferred an error message. (Because face it, the vast majority of these are going to be by mistake.)

But I think it worth asking if there are other use cases that I have not thought of that might make this an important capability to retain in PureODD.

martindholmes commented 1 year ago

What I'm actually suggesting is much more constrained than the discussion @sydb has opened up. I think it should be invalid to have multiple <attDefs> with the same @ident within the same <attList>. I think this is only likely to happen as a result of error (as in this case), and we could simply prevent it with Schematron. If it's not deemed erroneous, then the logic by which these things are to be combined must be defined in the Guidelines, and then we must include an implementation of that logic in one of the earlier stages of our ATOP processing.

On the question of whether a similar constraint should be applied to (e.g.) multiple <elementSpec>s which are children of a single <schemaSpec> having the same @ident, I'm open to being convinced of the value of such things, but the same rules apply: if we allow it, we must say what we mean by it and how all the potential ambiguities are to be resolved. And if there are no genuine instances of such things in the wild, then I would favour making it invalid until such time as someone has a use-case which is compelling enough to justify all the extra work it involves.

sydb commented 1 year ago

So, @martindholmes, you are suggesting we not allow 2 (or more) <attDef> siblings to have the same @ident, regardless of @mode, yes? This makes sense to me,[1] at least for the cases where the <attList> is a grouping, not an alternation (i.e., when there is not org="choice"). I have to think about the <attList org="choice"> case more.

Since <attList>s can nest, even without worrying about org="choice" the test in [2] is, although correct, insufficient to catch all cases.

So, thinking a bit about org="choice". Should the following not be legal?

<elementSpec ident="BG" mode="add" ns="http://www.example.edu/medical/chart/ns">
  <gloss>blood glucose</gloss>
  <classes>
    <memberOf key="att.global"/>
    <memberOf key="att.datable.w3c"/>
    <memberOf key="model.XMCdata"/>
  </classes>
  <content><empty/></content>
  <!--
      Here is the interesting bit — two mutually exclusive definitions
      of the same attribute:
  -->
  <attList org="choice">
    <attDef ident="value">
      <desc>30–499 mg/dL</desc>
      <datatype minOccurs="1" maxOccurs="1">
        <dataRef name="positiveInteger" restriction="([3-9]|[1-4][0-9])[0-9]"/>
      </datatype>
      <remarks>
        <p>A decimal point is not permitted.</p>
      </remarks>
    </attDef>
    <attDef ident="value">
      <desc>1.6–27.6 mmol/L</desc>
      <datatype minOccurs="1" maxOccurs="1">
        <dataRef name="decimal" restriction="(1\.[6-9])|(([2-9]|1[0-9]|2[0-6])\.[0-9])|(27\.[0-6])"/>
      </datatype>
      <remarks>
        <p>The decimal point is required.</p>
      </remarks>
    </attDef>
  </attList>
</elementSpec>

Sure, there may be better ways to do that, but I am not sure someone who has crafted this wants to be yelled at about having two <attDef>s with the same identifier. We would also want output RELAX NG that does the right thing, of course. And with the current stylesheets, we do get a correct schema (albeit with almost useless error messages when your document has an invalid @value.)

Notes [1] As I alluded to above, I think we can imagine cases where a user might want two (or more) definitions of the same attribute in the same list. But given that a) no one has ever naturally come up with such a case in the real world (at least, not that we know of) since P5 was introduced; b) many users (including me) make the inadvertent mistake of defining the same attribute twice in the same list, and c) allowing 2 (or more) puts a significant burden on the writers of ODD specs and processors, it seems like a good idea. [2] Here is a first crack at a rule for this. This one is probably insufficient due to the nested and choice attribute list concerns.

<sch:rule context="tei:attDef[@ident]">
  <sch:let name="ident" value="normalize-space(@ident)"/>
  <sch:report test="following-sibling::tei:attDef[ normalize-space(@ident) eq $ident ]">
    Only one definition of a given attribute allowed per attribute list.
    (In this case "<sch:value-of select="$ident"/>" is defined <sch:value-of select="count(../tei:attDef[ normalize-space(@ident) eq $ident ] )"/> times.)
  </sch:report>
</sch:rule>

[3] Here is a better crack at it. This one correctly handles nested attribute lists, except that it simply ignores any attribute defined inside a choice attribute list.

<sch:rule context="tei:attList[ not( ancestor::tei:attList ) ]">
  <!-- generate a sequence of my <attDef> descendants (that are not alternates): -->
  <sch:let name="defs" value="descendant::tei:attDef[ not( parent::tei:attList[ @org eq 'choice'] ) ]"/>
  <!-- get a sequence of the @idents of those <attDef>s: -->
  <sch:let name="idents" value="$defs/normalize-space(@ident)"/>
  <!-- get a sequence of any that occur 2+ times: -->
  <sch:let name="dups" value="for $n in $idents return $idents[ . eq $n ][2]"/>
  <!-- remove any duplicates in the list of duplicates (-: -->
  <sch:let name="distinct_dups" value="distinct-values( $dups )"/>
  <!-- if there any values in list of distinct duplicates, warn user about them: -->
  <sch:assert test="count($distinct_dups) eq 0">
    Within this attribute list the following attributes have been defined multiple times: <sch:value-of select="$distinct_dups"/>.
  </sch:assert>
</sch:rule>

[4] Both [2] and even [3] are a bit too permissive. That is, they will incorrectly fail to flag some unusual cases as errors; but I do not think either would falsely flag anything that is OK as an error.

martindholmes commented 1 year ago

@sydb I don't understand how a schema processor or validator could distinguish between two definitions of the same attribute, especially when their definitions are incompatible. How would that work in practice? What's the use-case?

sydb commented 1 year ago

Sorry, @martindholmes, perhaps because it is way past my bedtime, I don’t get the question. Are you asking about having two definitions of an attribute with the same name in alternation with one another (in which case I think the use case given above is pretty compelling, no?) or defining two attributes with the same name that could be on the same element at the same time (which is simply not allowed, so not of any concern to us here)? Except for the case of alternation, I am not talking (much) about anything that would be processed by a schema processor or validator (unless you think of an ODD processor as a schema processor :-). I am wondering (aloud) about cases where multiple occurrences of <attDef> are used to generate a single definition of an attribute in the output schema. (And concluding that such cases are not in demand, and would cause more trouble than they are worth.)

martindholmes commented 1 year ago

Just making a note that in order to create this problem, you need to create attDefs with different idents but with the same altIdent. Although the @mode combinations and resulting processor actions for attDefs with the same ident are clearly defined, there is no guidance on how to handle colliding altIdents.

ebeshero commented 1 year ago

Council F2F Paderborn: We recognize potential issues with ODD chaining in processing a choice between two definitions of an <attDef> with the same @ident. So we think @sydb should first proceed by seeing if <altIdent> will work in this case, and then let's revisit this ticket.

ebeshero commented 1 year ago

Update: @sydb has tested and found the use of <altIdent> performs as expected: the attribute name is defined by the <altIdent>. So, we agree with the recommendation that we establish a rule that no two descendants of the same <attList> should share the same @ident. (Instead people should use <altIdent>).

sydb commented 1 year ago

So given Council’s approval of the restriction “within an <attList> no two <attDef>s should have the same @ident (regardless of @mode)” rule (which is not intended to affect chained <attLists>, but is intended to apply to attribute lists assembled from various specification groups), the following should do the job:

  <sch:pattern>
    <sch:rule context="tei:attList[ not( ancestor::tei:attList ) ]">
      <!-- generate a sequence of my <attDef> descendants -->
      <sch:let name="defs" value="descendant::tei:attDef"/>
      <!-- get a sequence of the @idents of those <attDef>s: -->
      <sch:let name="idents" value="$defs/normalize-space(@ident)"/>
      <!-- get a sequence of any that occur 2+ times: -->
      <sch:let name="dups" value="for $n in $idents return $idents[ . eq $n ][2]"/>
      <!-- remove any duplicates in the list of duplicates (-: -->
      <sch:let name="distinct_dups" value="distinct-values( $dups )"/>
      <!-- if there are any values in list of distinct duplicates, warn user about them: -->
      <sch:assert test="count($distinct_dups) eq 0">
        Within this attribute list the following attributes have been defined multiple times: <sch:value-of select="$distinct_dups"/>.
      </sch:assert>
    </sch:rule>

Except that it does not take into account <specGrpRef>s.

sydb commented 7 months ago

The ATOP group (@sydb, @martindholmes, & @HelenaSabel) discussed this in detail. We think that:

Thus making GREEN for adding this rule.

In discussing this we came across another combination that makes no sense; see #2533.

sydb commented 7 months ago

Just so we’re all on the same page, I claim the following should be invalid.

  <attList>
    <attDef ident="a"/>
    <attDef ident="b"/>
    <attList org="choice">
      <attDef ident="b"/>
      <attDef ident="c"/>
    </attList>
    <attDef ident="d"/>
  </attList>

Even though it is possible to write an XML document that conforms to this (by picking choice @c), it is also possible to look at this and think that it is OK for the element to have two @b attributes, which is not allowed in XML. Furthermore, I do not think it is possible to generate a valid RELAX NG grammar or DTD that represents this weirdness. (Probably not an XSD, either, but what do I know?)

I am going to proceed on this ticket under the assumption that everyone agrees with this position. So if you disagree, speak up!

sydb commented 7 months ago

Another post along the same lines — Is it OK to have two or more <attDef ident="whatever" mode="delete"/>? Your immediate reaction might be “sure, no harm in deleting something more than once, the result is the same”, which I think is true. But I think, and what little evidence we have supports, that these are typically errors where the two deletions are not all that close to each other. Thus it seems to me the risk of serious problem occurs not when a processor tries to delete something that has already been deleted, but when the ODD writer comes along a year later and wants to re-instate the attribute, and only removes one of the deletions. So I am writing the constraint as it was discussed above, and not carving out an exception for mode=delete.