TEIC / Stylesheets

TEI XSL Stylesheets
232 stars 124 forks source link

Fix regression: duplicate idents in `*Spec`s #681

Closed joeytakeda closed 3 months ago

joeytakeda commented 5 months ago

Attempting to resolve the problem of duplicate idents in specs (as mentioned in #678 and #645 and potentially others). At the time of my filing this PR, Test1 is failing (in particular, test21), so more investigation (and more testing) is required before this can be merged.

joeytakeda commented 5 months ago

Fixed a cardinality constraint, so all tests are now passing. Going to mark this as ready for review

lb42 commented 5 months ago

Will this also fix #680? Happy to test it if so!

joeytakeda commented 5 months ago

Will this also fix #680? Happy to test it if so!

@lb42 : please do! I haven’t checked to see if this helps with the issues when chaining (though I suspect it might)

ebeshero commented 3 months ago

This is passing Test 1 for me, and mostly Test2 (except I think something is borked about the " [exec] BUILD FAILED [exec] /stylesheet/common/teianttasks.xml:205: Problem reading /stylesheet/Test2/outputFiles/temp-for-ant.zip") @joeytakeda is that the same (known) problem you see?

@lb42 Have you had a chance to test this on a chained ODD? I'd like to see how this does on one of your chained scenarios and see how much this simple yet profound fix addresses those other tickets. We're preparing to refridge (or is that heat) this Friday!

lb42 commented 3 months ago

I attach a zip containing a compiled ODD and another ODD which uses it. This produces a schema OK, but has trouble running odd-xhtml conversion : both in oxygen and using teitohtml :-( problem.zip Message is System ID: /home/lou/.com.oxygenxml/extensions/v25.0/frameworks/tei/tei/xml/tei/stylesheet/common/functions.xsl Scenario: TEI ODD XHTML XML file: /home/lou/Public/Schemas/ODD/eltec-0.xml XSL file: /home/lou/.com.oxygenxml/extensions/v25.0/frameworks/tei/tei/xml/tei/stylesheet/odds/odd2odd.xsl Document type: TEI ODD Engine name: Saxon-HE 11.4 Severity: fatal Problem ID: XPTY0004 Description: A sequence of more than one item is not allowed as the first operand of 'gt' A sequence of more than one item is not allowed as the first operand of 'gt' (@versionDate="2005-12-24", @versionDate="2005-12-24", @versionDate="2005-12-24")

lb42 commented 3 months ago

I spoke too soon -- the RNG generated is also wrong.

Advice on how to test/ confirmation that "it works for me" both gratefully received!

lb42 commented 3 months ago

The script I run says teitoodd --localsource=/home/lou/Public/TEI/P5/p5subset.xml eltec.xml eltec-library.xml teitorelaxng --localsource=/home/lou/Public/TEI/P5/p5subset.xml eltec-0.xml ../eltec-0.rng teitohtml --odd --localsource=/home/lou/Public/TEI/P5/p5subset.xml eltec-0.xml ../Doc/eltec-0.html I see I didnt include the mother odd in the zip, so here it is again problem.zip

sydb commented 3 months ago

@lb42 — I get An include with href 'eltec-body.xml' failed, and no fallback element was found from any attempt to process eltec-0.xml. I think we can just comment out that one <xi:include> line, because nothing in the <body> should change this problem, right? (I.e., I do not see any <specGrpRef>s that would be unresolvable without the <body> element.)

lb42 commented 3 months ago

Ah yes, just ignore that include. It's prose waffle not relevant to the odd

sydb commented 3 months ago

So I tried the the same 3 cmds @lb42 describes, first using the Stylesheets dev branch, and then using this branch. (Note though, that because I did not specify --localsource, I am building against the version of p5subset in each Stylesheets repo; I checked, though, they are the same as each other, although probably not the same as the local one Lou used.)

I then compared each output of these cmds when run using the dev branch of Stylesheets to the corresponding output when run using this branch.

The HTML files were the same (except for timestamps and comments). The eltec-library.xml files seem vastly different from one another. At least in 1 or two cases, though, the difference is that the dev-generated version has duplicate code that the version generated from this branch does not. However, there are lots of other differences, too.

The RELAX NG files, on the other hand, were only somewhat different. But at least for most of the differences which I looked at in detail, it seems to me the version from this branch is the better one.

  1. att.canonical.attributes is incorrectly defined in the dev-generated version: it is defined as 2 copies of the pattern att.canonical.attribute.ref, which itself is never defined. It is “correctly” defined in the version generated against this branch, in that there is only 1 occurrence of the att.canonical.attribute.ref and it is defined, I think properly (i.e., has example from eltec.xml).
  2. <measure> is (incorrectly) a member of model.measureLike in the dev-generated version; model.measureLike is (correctly) empty in the version generated against this branch. (This is because <measure> is removed from model.measureLike in eltec.xml.)
  3. In the defintion of the <TEI> element:
    1. The gloss “TEI document” occurs three times in a row in the dev-generated version, and (correctly) occurs only once in the version generated against this branch.
    2. The description “contains a single … or <teiCorpus> element.” occurs three times in a row in the dev-generated version, and (correctly) occurs only once in the version generated against this branch.
    3. HOWEVER, the element names in that description have lost their pointy brackets in the version generated against this branch.
    4. The reference “4. Default Text Structure 15.1. Varieties of Composite Text” also occurs 3 times in the dev-generated version.
    5. But what is really bad is that the content model is repeated 3 times in the dev-generated version
  4. I did not look carefully, but it seems that this tripling problem occurs (in the dev-generated version) for att.global and <measure>, too.
  5. The RELAX NG generated against this branch is invalid due to duplicate copies of the definition of @calendar in various elements.
  6. The RELAX NG generated against dev is invalid for the same reason, but also because the definition of <measure> is FUBAR.

I am not entirely sure what all this means other than a) ODD chaining is quite fragile, and b) while I still think @joeytakeda’s change, here, is probably an improvement, it seems to me more testing is probably in order.

lb42 commented 3 months ago

Thanks for confirming that the recent changes made to oddtoodd have fubarred my nice eltec schema! Clearly Joey's changes improve the status quo, even if it's not clear whether they fix everything. Rolling back and insufficiently tested versions of odd2odd =might also be an option.

ebeshero commented 3 months ago

@lb42 @sydb In our Council meeting today, we discussed the complexities of testing to try to resolve the chained ODD processing problems, and thanks to @joeytakeda for stepping us through what he tried and precisely what the problems are. We're marking this as a release blocker. We're talking about testing previous Stylesheets versions in addition to this branch to compare what's going on and see when / how we introduced the problem. More soon, we hope!

lb42 commented 3 months ago

It seems pretty clear that whatever is going wrong with my ELTeC schema has nothing to do with ODD chaining as such. There are other examples of ODD chaining which don't exhibit this behaviour, and there is a stand alone version of the ELTeC schema (attached as a text file) which does

eltec-standalone.txt

ebeshero commented 3 months ago

@lb42 It turns out you're correct, that what was going wrong with ELTeC schema wasn't to do with ODD chaining at all. We investigated the problem in our Stylesheets meeting today and discovered an interesting problem: What generates the duplicate idents is specific this problem:

We tested this with other attributes, too, and we consistently generate the same error that we're now working on in this new ticket: #687

Basically you can address this by updating your reference to @calendar in your ODDs, but we're hoping to stop this kind of error from happening at all...

lb42 commented 3 months ago

Thanks for the comment @ebeshero, but I don't think that's my problem. The eltec-standalone.txt file which I posted on this ticket does not reference @calendar at all (the line you quote is commented out), but this file still generates errors when processed to HTML. Please try downloading my test file, renaming it to .xml, and open it with an Oxygen using the most recent framework release (4.8.44) and you will see the problem.(Interestingly, if I use the previous version of the oxygen framework, the problem goes away, which suggests strongly that this is something you guys introduced...

ebeshero commented 3 months ago

@lb42 No indeed, we found the problem in eltec.xml, where from line 122 we see the following:

     <classSpec type="atts" ident="att.datable" mode="change">
      <attList>
       <attDef ident="calendar" mode="delete"/>
       <attDef ident="period" mode="delete"/>
      </attList>
     </classSpec>

We found the Relax NG generated from this ODD as well as the HTML to be bearing multiple definitions of the very @calendar attribute it was calling to delete, and @period to be handled just fine.

You're right that this is based on something we changed. That change was simply moving @calendar out of att.datable and into a new class, att.calendarSystem. The processing of that should not have been generating duplicate stuff, but it does, and we think we've figured out where and how, and have a (possible/quite likely) repair for it. See #687.

joeytakeda commented 3 months ago

Hi @lb42 — just for completeness' sake, I ran teitorng and teitohtml on the eltec-standalone file you sent using the dev branch (which now includes the above change), which ran without errors.

I believe your issue here (as it was in #645 and #678) was duplicate elementSpecs with the same ident (in this case, it was the elementSpec for the TEI element, defined around lines 147, 319, and 888). This was a regression in the Stylesheets (caused by change to ensure uniqueness in element names across namespaces), but since none of our test files had duplicate elementSpecs, this was never caught.

At the very least, this will make it into the new release (which should occur within the next few weeks). As far as I understand it, though, the oXygen framework will only update with a new oXygen release (though @martindholmes will know more about that)

lb42 commented 3 months ago

Sorry I was insufficiently clear. "the problem" i am referring to is nothing to do with the @calendar related bug (thanks for fixing which). The file "eltec-standalone" doesn't use eltec.xml -- as the name suggests it is a standalone unchained version. But it has something wrong with it because it fails to generate HTML. The message generated is in XSL file: /home/lou/.com.oxygenxml/extensions/v25.0/frameworks/tei/tei/xml/tei/stylesheet/odds/odd2odd.xsl A sequence of more than one item is not allowed as the first operand of 'gt' (@versionDate="2005-12-24", @versionDate="2005-12-24", @versionDate="2005-12-24") Maybe I should raise this as a separate issue.

lb42 commented 3 months ago

Ah, I have now seen Joey's comment. Thanks! There is indeed a duplicate declaration for TEI in the standalone file (there are three in fact) and removing it removes the problem. BUT this is very worrying: the ODD system is intended to permit multiple declarations for the same object. A processor has to know how to reconcile/unify them sensibly...

martindholmes commented 3 months ago

@lb42 I don't believe there is a canonically sensible way to handle duplicate definitions, and the ODD spec doesn't provide such an algorithm, so I think ATOP will treat them as errors and argue that they should be invalid.

lb42 commented 3 months ago

@martindholmes It rather depends what you mean by duplicate definition. If i supply an elementspec containing just an exemplum for example is that a duplicate?

martindholmes commented 3 months ago

Yes. You can devise scenarios in which a human could effortlessly combine the instances into a coherent whole, but it's difficult to write an algorithm to determine when that pertains and when it doesn't. And if you can easily combine them, then why not do it in your ODD?

lb42 commented 3 months ago

@martindholmes Seems to me that if i say <elementSpec ident="foo" mode="change"> it is plausible to raise an error, or at least a warning, in the case where there ISN'T a duplicate element with ident foo. So it rather depends what you are classing as "duplicate". If there are two or more <elementSpec ident="foo" mode="change"> I can see why a lazy ODD processor might not wish to go through the tedious business of deciding whether the various changes are mutually compatible, but my understanding is that the ODD system requires such capabilities all over the place. Of course atoc is free to choose whether or not to provide them, but they are not inherently nonsensical. You might well when authoring your ODD regard the production of examples as a separate exercise from defining the content models etc. and so provide them in a separate specGroup. You might even want to have different sets of examples for different applications of the same ODD subset. So you might have lots of "duplicate" specs containing just an exemplum.

martindholmes commented 3 months ago

@lb42 As far as we know (based on all the ODDs we collected as part of our initial survey when starting ATOP), you're the only person who uses duplicate elementSpecs -- there was one other case but it was a genuine error, with the same elementSpec appearing twice. I think if you have a very clear idea of exactly how an ODD processor should resolve all the potential permutations arising from a multiplicity of elementSpecs with the same @ident and namespace, in the various orders in which they could appear, then I think it would be up to you to lay out the algorithm, and up to the Council to decide whether to adopt it as part of the ODD specification. But since most ODD writers never do this, and there is no formal algorithm in the specification, I think it makes more sense for ATOP to treat it as an error. You may think that's lazy, but we also have to consider the downstream effects of writing lots of code to resolve unpredictable combinations of specs without any guidance from the spec, and then condemn future maintainers to wrestle with it. @sydb and @HelenaSabel, any thoughts on this?

lb42 commented 3 months ago

I am happy to withdraw the provocative term "lazy"! I don't think you're really addressing my point though -- which is that all ODD writers, including ones not yet born, need clear guidance on how far they need to go in combining into a single spec all its constituent parts.

ebeshero commented 3 months ago

I have to second @martindholmes here after we've repeeatedly struggled with repairs and tests for odd2odd.xsl. Trying to figure out which conditions are being handled and when is a pretty complicated process in the current odd2odd, and trying to account for them in ATOP won't be easy.

Anyway, @lb42 's ELTeC files did prove immensely helpful for finding that serious problem with processing re-located attribute classes! But having to reconcile multiple elementSpecs sharing the same @ident in a single standalone ODD does seem an extraordinary expectation. I could see these arising in separate files in a chaining process, but shouldn't the standalone at least reconcile them into a single elementDef? And maybe this discussion of guidance for handling multiple constituent parts of odds, chained, and unchained, should be a new ticket.

martindholmes commented 3 months ago

@ebeshero The way I conceive of chaining is that it's a sequence of discrete steps, so no matter how long the chain, the processor should only ever have to resolve one unique spec in the customization against one unique spec in the preceding source ODD.