discrepancy of P5 version number vs definition of version attribute of TEI

sydb commented 4 years ago

The version number of the TEI Guidelines is (at least nowadays) stored in ./p5.xml at /TEI/teiHeader/fileDesc/editionStmt/edition/ref[2]. The value therein is copied from ./VERSION.

The current contents of that file (in the dev branch) is “4.1.0a”. Makes sense. The current release is 4.0.0, and we are anticipating the release of 4.1.0 in August or thereabouts by referring to the development branch copy as an alpha. IIRC, we plan to switch that to beta at the time when we go into “refrigeration” mode, and thus will use “4.1.0b” for a couple of weeks; and then the new release will be “4.1.0”. All well and good.

BUT the definition of the @version attribute of <TEI> is teidata.version, which in turn is defined as a token that matches "[\d]+(.[\d]+){0,2}", and gets that definition from the Unicode standard. Thus the ‘a’ or ‘b’ at the end of our version number (which I submit could just as well be an ‘α’ or ‘β’) for P5 cannot be used on the @version attribute of <TEI> (or <teiCorpus>).

This strikes me as bad on general principles. But in specific, it causes a problem for building tei_customization which wants to use the version number of the p5.xml that was used as its source as both the @version of its own <TEI> and the @version of a TEI document that conforms to it. (And, I daresay, all Exemplars should do that. But that’s a different ticket.)

Possible solutions:

Stop using ‘a’, ‘b’, ‘α’, or ‘β’ in our version numbers. I don’t like this for a variety of reasons, most obviously that it means the version number of an in-development version looks like a released version number.
Strip the ‘a’, ‘b’, ‘α’, or ‘β’ off before using the VERSION number on @version of tei_customizatoin. I don’t like this because that means tei_customization is lying about its own pedigree.
Change the definition of @version of <TEI>, or @version of att.versioned to allow for an optional ultimate ‘a’, ‘b’, ‘α’, or ‘β’. I like this, but it does involve some work. First, we would have to decide on whether or not to keep @version of <TEI> and <unicodeName> the same (allowing a trailing letter where none should be allowed in the case of <unicodeName>), or have <unicodeName> inherit then modify the attribute so it is further restricted. Second, we would no longer be able to rely soley on the Unicode definition of a version number.

lb42 commented 4 years ago

A fourth and perhaps less disruptive solution would be to leave the version unsuffixed, and use @status to indicate whether this is an alpha/beta/whatever version. This would just entail making TEI a member of the att.docStatus class.

duncdrum commented 4 years ago

wouldn't it be nice if @version and the TEI actually had a valid semantic version number, i.e. not 4.0.0a. So +1 on @lb42 suggestion to use @status for anything other than major.minor.patch.

P.S.: using @version this way would also solve our problem for <unicodeName>

sydb commented 4 years ago

I like that, @lb42 (and @duncdrum), but have to investigate whether it solves the problem I’m referring to.

sydb commented 4 years ago

@lb42: Hmmm… don’t think so. if <TEI> were a member of att.docStatus, then wouldn’t <TEI version="4.1.0" status="alpha"> mean that the TEI document was itself in the alpha stage of production, rather than the TEI document is expected to conform to the alpha version of P5 v. 4.1.0?

@duncdrum: I’m a bit confused.

While semantic version numbering is well defined and has some extension capabilities we don’t currently have (and might want), the only difference between what we have now and and using semantic version numbering would be “4.1.0-alpha” over “4.1.0a”. I don’t understand why that relates to use of @status. (Which, as I said above, I don’t think does the job.)
What problem do we have for versioning of <unicodeName>? (I thought part of the reason we picked the Unicode definition was just so it worked well with it; or is that reverse derivation or wishful thinking?)

duncdrum commented 4 years ago

If @status doesn’t do the trick. I’m in favor of changing the naming scheme to semantic version, regardless. So @version can be used more easily on other things in local costumizations, without adjusting the regex pattern.

As for Unicode versions, I m pretty sure those will always be valid sem-ver as well.

sydb commented 4 years ago

I believe I am correctly representing the entire Council’s point of view to say that

The value of @version (from att.versioned) should match the syntax provided by sem-ver. (See below.)
For now we plan to use “alpha” and “beta” as our pre-release indicators (which always follow a hyphen. Thus values like “4.1.0-alpha”.)
We may use the “+build” mechanism at some point, but not immediately
We are not using the precise semantics of sem-ver, but rather the semantics we already use for major-dot-minor-dot-patch (after all, we’re not producing software with an API).

below The regular expression, modified from the sem-ver spec itself, will be: 0|[1-9]\d*\.0|[1-9]\d*\.0|[1-9]\d*(-((0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(\.(0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(\+([0-9a-zA-Z-]+(\.[0-9a-zA-Z-]+)*))?.

duncdrum commented 4 years ago

There seems to be an error in the adjusted regex pattern, if I understand the intention correctly that @version should be able to contain valid sem-ver strings. Screenshot 2020-05-04 at 11 50 44

This seems to do the trick, YMMV:

(0|[1-9]\d*\.0|[1-9]\d*\.0|[1-9]\d*)\.?(-((0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(\.(0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(\+([0-9a-zA-Z-]+(\.[0-9a-zA-Z-]+)*))?

peterstadler commented 4 years ago

Had to rewrite it a little more to avoid validation errors (in oXygen) for p5odds-examples.rng:

(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(-((0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*)(\.(0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*))*))?(\+([\-0-9a-zA-Z]+(\.[\-0-9a-zA-Z]+)*))?

peterstadler commented 4 years ago

I did work on this in branch issue-1993 but it's still failing. If you wanna have a look … any feedback is appreciated :)

duncdrum commented 4 years ago

@peterstadler from the looks of it you haven t updated the 4.0.0a version string, which is not a valid sem ver number. So Travis failing seems correct to me.

Also the French guidelines have 5.0 whereas the en has 5.0.0

sydb commented 4 years ago

I had not planned to do this until late tonight or tomorrow; will look at regexes then, but would not be surprised at failures until.

sydb commented 4 years ago

Oh dear. First thing I’m working on is the regular expression itself. For the regular expression posted above, I just started with the regex pointed at by the sem-ver spec, and removed the bits we don't need. (We don't need many of the parenthetical groupings, because we are not going to be doing replacement; and we don't want non-captured groups, because W3C doesn't support that.)

But (as @peterstadler points out), oXygen reports that regular expression is invalid. It is wrong, IMHO. This is just jing complaining of a character range (“charRange”, production 17, here used as part of a positive character group (“posCharGrp”, production 14)) which is only one character, a U+002D. I have always considered this a bug in jing (but have also been worried that I might be wrong, because rnv reports similar). If you read the spec, the productions imply that U+002D is allowed as a character used in a character range (in this case, a character range of only 1 character represented by itself). There is a constraint expressed in the prose, though (3rd bullet point):

The - character is a valid character range only at the beginning or end of a ·positive character group·

But in all cases in the regex I wrote, the U+002D is at the end of a positive character group. So a large part of me doesn’t care. After all, other W3C regex processors do not have a problem with this. While this attitude may get me off the hook morally, it does not solve our problem, because we use jing to validate our documents dozens of times a day.

So it seems to me we do have to fix this. That said, in the meantime @duncdrum thinks there is an error in the regular expression I posted, and posted an alternative; and @peterstadler (who I belatedly realize is actually trying to implement this ticket) has come up with a version that is valid per jing.

So I went about testing them. I used the set of versions used on the regex101 page that is pointed to by the sem-ver page (and I think @duncdrum used, too) as a test suite. I tried validating each of the 71 tests against each of the 3 regular expressions. I used Schematron because I could not use jing or rnv to validate RelaxNG.

All 3 of our regular expressions correctly say test cases 1–31 are valid. Then it gets weird.

all 3 claim (incorrectly?) that 34–6, 50, 52–9, 61, 63, 65, 67, and 69–71 are valid
@sydb’s and @duncdrum’s also claim (incorrectly?) that 32, 33, 40, 43, 44, 60, 62, 64, and 66 are valid

So … so far, of the 3 regular expressions we have, @peterstadler’s is the clear winner. But it still says 19 of the supposedly invalid test cases are valid.

I have to admit, when I created my version above, I took the regex101 expression at face value, and did not test it. (I kinda doubt the other gentlemen did, either.) But it looks like either the test suite or the expression has some problems.

I am starting to think I should re-write this regex myself from scratch by reading the sem-ver spec, and ignoring other examples of regular expressions. But that will take several hours. And someone (maybe me) should look very carefully at the test suite, and see if any of those it claims are invalid are actually valid per the spec.

Here are my test files: ticket_1993_test_01.zip

ebeshero commented 4 years ago

@sydb @peterstadler @duncdrum In a vain(?) effort to try to simplify the hideousness of these regular expressions, I tried my hand at it with Syd's handy test file. This isn't perfect, but it's a little shorter anyway:

(\d+[-.])+((\d*[-.+]*)?([A-z]+[.+-]?[A-z0-9]*))*

Should we be trying to simplify this versioning system anyway?

peterstadler commented 4 years ago

Despite the regex issue, I feel a little bit uneasy (now that I looked at it more closely) about changing teidata.version because it might break all sorts of users' encodings (e.g. <TEI version="2014"> will be marked invalid). We also have teidata.versionNumber (which by the way looks like a sloppy regex (as well) because it allows for e.g. "1.2.3.4" or "01.02" or even "0") which makes me think whether we should (only) have the two options teidata.semverNumber (which follows the semver specification) and teidata.looselySomethingLikeAVersionNumber (aka teidata.enumerated).

So, I think what I'd like to see is that we keep backwards compatibility for TEI/@version et al. but enforce a semantic versioning scheme for the Guidelines themself – and all other pieces of code and software within the TEIC realm.

duncdrum commented 4 years ago

I m less hesitant about backwards compatibility than @peterstadler. Things will break only if ppl specify a new @version in their TEI file after upgrading. Which seems fair, we want to encourage sem-ver. My guess is that many projects, already use sem-ver anyways.

I m more hesitant about creating out own regex though. If we continue down that road, we should add the regex tests to the tei testsuite, to make sure that we stick to the sem-ver specs. Good or bad.

sydb commented 4 years ago

You have a long way to go to convince me on this, @peterstadler. Besides @duncdrum’s valid (pardon the pun) argument, the fact that @version allowed "2014" is a corrigible error, and one that we should be embarrassed about.

I think there is a very strong argument to be made that the value of @version should be required to be the correct version number. (And, for that matter, that it be a required attribute, but that is a different ticket.) That is, instead of a regular expression that would allow the value "2.3.0" on a document that is valid against P5 version 4.3.0, the schema associated with version 4.3.0 should require a value of "4.3.0" (or perhaps "4.3.(0|[1-9]\d*)", since patch fixes should not affect schema validity, eh?).

That is, of course, what I’m trying to do with tei_customization. And it’s not as simple as that, because I (at least at the moment) think we should allow “-alpha”, “+MoEML-6.2”, and perhaps most importantly “+Stylesheets-7.53.1” appendices.

But that’s not quite right, because SemVer gives “4.3.1+Stylesheets-7.53.1” the same precedence as “4.3.1+Stylesheets-7.53.6”, “4.3.1+Stylesheets-7.58.0”, or “4.3.1+platypus”, and I think we would want the later Stylesheets version to have higher precedence.

peterstadler commented 4 years ago

I'll withdraw from my argument and join @sydb and @duncdrum :) Somehow I got confused by all the various @version attributes but we're only discussing @version on <TEI> and <teiCorpus> (and the deprecated <unicodeName>).

peterstadler commented 4 years ago

Just created the PR #1996 because after some more testing (and reverted changes) I think it's good to go.

Thanks @sydb for the test files and indeed the results are strange. Yet, I believe this to be some Schematron issue because the following RelaxNG schema does validate your test file as intended (and works the same way with @sydb's provided rnc):

<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="tests">
      <oneOrMore>
        <element name="test">
          <attribute name="n">
            <data type="integer"/>
          </attribute>
          <element name="sydb">
            <text/>
          </element>
          <element name="duncdrum">
            <text/>
          </element>
          <element name="peterstadler">
            <data type="token">
              <param name="pattern">(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(-((0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*)(\.(0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*))*))?(\+([\-0-9a-zA-Z]+(\.[\-0-9a-zA-Z]+)*))?</param>
            </data>
          </element>
        </element>
      </oneOrMore>
    </element>
  </start>
</grammar>

(NB: I couldn't paste in your regexes because they were flagged as invalid so your tests will always return ok.)

So, since we're not using Schematron for validating the @version attribute values I think we can skip that particular side issue and move ahead with the working regex?

peterstadler commented 4 years ago

It seems I've been too quick with the PR #1996

To me it was clear that we'd follow the SemVer specification and that we'd want to update teidata.version accordingly. I can see at least two benefits:

in following the SemVer scheme, the TEI gains a predictable and interoperable version number
the TEI acknowledges that standard and facilitates its use in TEI documents

Of course, we'd need to flesh out what MAJOR, MINOR, and PATCH changes are in the TEI Guidelines world but IMHO that's a better exercise than inventing one's own version number scheme.

martinascholger commented 3 years ago

I asked on TEI-L at 2020-07-16 and on 2020-10-12 if removing the version attribute on <TEI> and <teiCorpus> would affect someone's work. There were few reactions, but all of them said that we should go ahead.

duncdrum commented 3 years ago

I asked on TEI-L at 2020-07-16 and on 2020-10-12 if removing the version attribute on <TEI> and <teiCorpus> would affect someone's work. There were few reactions, but all of them said that we should go ahead.

Literally remove @version from <TEI> ?? I must misunderstand, but that would definitely impact all my projects. Cant see why that would be a good idea either.

sydb commented 3 years ago

Yes, literally remove @version from <TEI> (and <teiCorpus>). It would be a good idea to avoid the problem this ticket is about, to rid ourselves of an embarrassing corrigible error, and to reduce the chance of confusion when an instance has mis-matching indications of the version against which it is supposed to be valid.

Keep in mind there are already two other mechanisms (xml-model processing instruction and the <schemaRef> element) for this.

martindholmes commented 3 years ago

@duncdrum Just out of interest, what do you use @version for?

duncdrum commented 3 years ago

To specify the version of the Guidelines, that a document is using. Usually before and after a contributions is particularly important.

HelenaSabel commented 6 months ago

VF2F meeting has discussed this ticket and the related PR #1996 and the decision is to change the datatype specification of teidata.version to allow for multiple variations (semver-like syntax will be one of those variations and the one associated with @version within <TEI> and <teiCorpus>).

A preliminary proposal for teidata.version datatypes:

teidata.version
- teidata.version.semVer (TEI/@version, teiCorpus/@version)
- teidata.version.calVer (e.g.: 20240315)
- teidata.version.UnicodeVer (att.gaijiProp)
- teidata.version.versionNumber (e.g. 4.7.0)

TEIC / TEI

discrepancy of P5 version number vs definition of version attribute of TEI #1993