Open sydb opened 4 years ago
A fourth and perhaps less disruptive solution would be to leave the version unsuffixed, and use @status to indicate whether this is an alpha/beta/whatever version. This would just entail making TEI a member of the att.docStatus class.
wouldn't it be nice if @version
and the TEI actually had a valid semantic version number, i.e. not 4.0.0a
. So +1 on @lb42 suggestion to use @status
for anything other than major.minor.patch
.
P.S.: using @version
this way would also solve our problem for <unicodeName>
I like that, @lb42 (and @duncdrum), but have to investigate whether it solves the problem I’m referring to.
@lb42:
Hmmm… don’t think so. if <TEI>
were a member of att.docStatus, then wouldn’t <TEI version="4.1.0" status="alpha">
mean that the TEI document was itself in the alpha stage of production, rather than the TEI document is expected to conform to the alpha version of P5 v. 4.1.0?
@duncdrum: I’m a bit confused.
@status
. (Which, as I said above, I don’t think does the job.)<unicodeName>
? (I thought part of the reason we picked the Unicode definition was just so it worked well with it; or is that reverse derivation or wishful thinking?)If @status
doesn’t do the trick. I’m in favor of changing the naming scheme to semantic version, regardless. So @version
can be used more easily on other things in local costumizations, without adjusting the regex pattern.
As for Unicode versions, I m pretty sure those will always be valid sem-ver as well.
I believe I am correctly representing the entire Council’s point of view to say that
@version
(from att.versioned) should match the syntax provided by sem-ver. (See below.)below
The regular expression, modified from the sem-ver spec itself, will be: 0|[1-9]\d*\.0|[1-9]\d*\.0|[1-9]\d*(-((0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(\.(0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(\+([0-9a-zA-Z-]+(\.[0-9a-zA-Z-]+)*))?
.
There seems to be an error in the adjusted regex pattern, if I understand the intention correctly that @version
should be able to contain valid sem-ver strings.
This seems to do the trick, YMMV:
(0|[1-9]\d*\.0|[1-9]\d*\.0|[1-9]\d*)\.?(-((0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(\.(0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(\+([0-9a-zA-Z-]+(\.[0-9a-zA-Z-]+)*))?
Had to rewrite it a little more to avoid validation errors (in oXygen) for p5odds-examples.rng
:
(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(-((0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*)(\.(0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*))*))?(\+([\-0-9a-zA-Z]+(\.[\-0-9a-zA-Z]+)*))?
I did work on this in branch issue-1993
but it's still failing. If you wanna have a look … any feedback is appreciated :)
@peterstadler from the looks of it you haven t updated the 4.0.0a
version string, which is not a valid sem ver number. So Travis failing seems correct to me.
Also the French guidelines have 5.0 whereas the en has 5.0.0
I had not planned to do this until late tonight or tomorrow; will look at regexes then, but would not be surprised at failures until.
Oh dear. First thing I’m working on is the regular expression itself. For the regular expression posted above, I just started with the regex pointed at by the sem-ver spec, and removed the bits we don't need. (We don't need many of the parenthetical groupings, because we are not going to be doing replacement; and we don't want non-captured groups, because W3C doesn't support that.)
But (as @peterstadler points out), oXygen reports that regular expression is invalid. It is wrong, IMHO. This is just jing complaining of a character range (“charRange”, production 17, here used as part of a positive character group (“posCharGrp”, production 14)) which is only one character, a U+002D. I have always considered this a bug in jing (but have also been worried that I might be wrong, because rnv reports similar). If you read the spec, the productions imply that U+002D is allowed as a character used in a character range (in this case, a character range of only 1 character represented by itself). There is a constraint expressed in the prose, though (3rd bullet point):
The - character is a valid character range only at the beginning or end of a ·positive character group·
But in all cases in the regex I wrote, the U+002D is at the end of a positive character group. So a large part of me doesn’t care. After all, other W3C regex processors do not have a problem with this. While this attitude may get me off the hook morally, it does not solve our problem, because we use jing to validate our documents dozens of times a day.
So it seems to me we do have to fix this. That said, in the meantime @duncdrum thinks there is an error in the regular expression I posted, and posted an alternative; and @peterstadler (who I belatedly realize is actually trying to implement this ticket) has come up with a version that is valid per jing.
So I went about testing them. I used the set of versions used on the regex101 page that is pointed to by the sem-ver page (and I think @duncdrum used, too) as a test suite. I tried validating each of the 71 tests against each of the 3 regular expressions. I used Schematron because I could not use jing or rnv to validate RelaxNG.
All 3 of our regular expressions correctly say test cases 1–31 are valid. Then it gets weird.
So … so far, of the 3 regular expressions we have, @peterstadler’s is the clear winner. But it still says 19 of the supposedly invalid test cases are valid.
I have to admit, when I created my version above, I took the regex101 expression at face value, and did not test it. (I kinda doubt the other gentlemen did, either.) But it looks like either the test suite or the expression has some problems.
I am starting to think I should re-write this regex myself from scratch by reading the sem-ver spec, and ignoring other examples of regular expressions. But that will take several hours. And someone (maybe me) should look very carefully at the test suite, and see if any of those it claims are invalid are actually valid per the spec.
Here are my test files: ticket_1993_test_01.zip
@sydb @peterstadler @duncdrum In a vain(?) effort to try to simplify the hideousness of these regular expressions, I tried my hand at it with Syd's handy test file. This isn't perfect, but it's a little shorter anyway:
(\d+[-.])+((\d*[-.+]*)?([A-z]+[.+-]?[A-z0-9]*))*
Should we be trying to simplify this versioning system anyway?
Despite the regex issue, I feel a little bit uneasy (now that I looked at it more closely) about changing teidata.version
because it might break all sorts of users' encodings (e.g. <TEI version="2014">
will be marked invalid).
We also have teidata.versionNumber
(which by the way looks like a sloppy regex (as well) because it allows for e.g. "1.2.3.4" or "01.02" or even "0") which makes me think whether we should (only) have the two options teidata.semverNumber
(which follows the semver specification) and teidata.looselySomethingLikeAVersionNumber
(aka teidata.enumerated
).
So, I think what I'd like to see is that we keep backwards compatibility for TEI/@version
et al. but enforce a semantic versioning scheme for the Guidelines themself – and all other pieces of code and software within the TEIC realm.
I m less hesitant about backwards compatibility than @peterstadler. Things will break only if ppl specify a new @version
in their TEI file after upgrading. Which seems fair, we want to encourage sem-ver. My guess is that many projects, already use sem-ver anyways.
I m more hesitant about creating out own regex though. If we continue down that road, we should add the regex tests to the tei testsuite, to make sure that we stick to the sem-ver specs. Good or bad.
You have a long way to go to convince me on this, @peterstadler. Besides @duncdrum’s valid (pardon the pun) argument, the fact that @version
allowed "2014" is a corrigible error, and one that we should be embarrassed about.
I think there is a very strong argument to be made that the value of @version
should be required to be the correct version number. (And, for that matter, that it be a required attribute, but that is a different ticket.) That is, instead of a regular expression that would allow the value "2.3.0" on a document that is valid against P5 version 4.3.0, the schema associated with version 4.3.0 should require a value of "4.3.0" (or perhaps "4.3.(0|[1-9]\d*)", since patch fixes should not affect schema validity, eh?).
That is, of course, what I’m trying to do with tei_customization. And it’s not as simple as that, because I (at least at the moment) think we should allow “-alpha”, “+MoEML-6.2”, and perhaps most importantly “+Stylesheets-7.53.1” appendices.
But that’s not quite right, because SemVer gives “4.3.1+Stylesheets-7.53.1” the same precedence as “4.3.1+Stylesheets-7.53.6”, “4.3.1+Stylesheets-7.58.0”, or “4.3.1+platypus”, and I think we would want the later Stylesheets version to have higher precedence.
I'll withdraw from my argument and join @sydb and @duncdrum :)
Somehow I got confused by all the various @version
attributes but we're only discussing @version
on <TEI>
and <teiCorpus>
(and the deprecated <unicodeName>
).
Just created the PR #1996 because after some more testing (and reverted changes) I think it's good to go.
Thanks @sydb for the test files and indeed the results are strange. Yet, I believe this to be some Schematron issue because the following RelaxNG schema does validate your test file as intended (and works the same way with @sydb's provided rnc):
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<element name="tests">
<oneOrMore>
<element name="test">
<attribute name="n">
<data type="integer"/>
</attribute>
<element name="sydb">
<text/>
</element>
<element name="duncdrum">
<text/>
</element>
<element name="peterstadler">
<data type="token">
<param name="pattern">(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(-((0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*)(\.(0|[1-9]\d*|\d*[\-a-zA-Z][\-0-9a-zA-Z]*))*))?(\+([\-0-9a-zA-Z]+(\.[\-0-9a-zA-Z]+)*))?</param>
</data>
</element>
</element>
</oneOrMore>
</element>
</start>
</grammar>
(NB: I couldn't paste in your regexes because they were flagged as invalid so your tests will always return ok.)
So, since we're not using Schematron for validating the @version
attribute values I think we can skip that particular side issue and move ahead with the working regex?
It seems I've been too quick with the PR #1996
To me it was clear that we'd follow the SemVer specification and that we'd want to update teidata.version
accordingly. I can see at least two benefits:
Of course, we'd need to flesh out what MAJOR, MINOR, and PATCH changes are in the TEI Guidelines world but IMHO that's a better exercise than inventing one's own version number scheme.
I asked on TEI-L at 2020-07-16 and on 2020-10-12 if removing the version attribute on <TEI>
and <teiCorpus>
would affect someone's work. There were few reactions, but all of them said that we should go ahead.
I asked on TEI-L at 2020-07-16 and on 2020-10-12 if removing the version attribute on
<TEI>
and<teiCorpus>
would affect someone's work. There were few reactions, but all of them said that we should go ahead.
Literally remove @version
from <TEI>
?? I must misunderstand, but that would definitely impact all my projects. Cant see why that would be a good idea either.
Yes, literally remove @version
from <TEI>
(and <teiCorpus>
). It would be a good idea to avoid the problem this ticket is about, to rid ourselves of an embarrassing corrigible error, and to reduce the chance of confusion when an instance has mis-matching indications of the version against which it is supposed to be valid.
Keep in mind there are already two other mechanisms (xml-model processing instruction and the <schemaRef>
element) for this.
@duncdrum Just out of interest, what do you use @version
for?
To specify the version of the Guidelines, that a document is using. Usually before and after a contributions is particularly important.
VF2F meeting has discussed this ticket and the related PR #1996 and the decision is to change the datatype specification of teidata.version
to allow for multiple variations (semver-like syntax will be one of those variations and the one associated with @version
within <TEI>
and <teiCorpus>
).
A preliminary proposal for teidata.version
datatypes:
teidata.version
TEI/@version, teiCorpus/@version
)att.gaijiProp
)
The version number of the TEI Guidelines is (at least nowadays) stored in ./p5.xml at /TEI/teiHeader/fileDesc/editionStmt/edition/ref[2]. The value therein is copied from ./VERSION.
The current contents of that file (in the dev branch) is “4.1.0a”. Makes sense. The current release is 4.0.0, and we are anticipating the release of 4.1.0 in August or thereabouts by referring to the development branch copy as an alpha. IIRC, we plan to switch that to beta at the time when we go into “refrigeration” mode, and thus will use “4.1.0b” for a couple of weeks; and then the new release will be “4.1.0”. All well and good.
BUT the definition of the
@version
attribute of<TEI>
is teidata.version, which in turn is defined as a token that matches "[\d]+(.[\d]+){0,2}", and gets that definition from the Unicode standard. Thus the ‘a’ or ‘b’ at the end of our version number (which I submit could just as well be an ‘α’ or ‘β’) for P5 cannot be used on the@version
attribute of<TEI>
(or<teiCorpus>
).This strikes me as bad on general principles. But in specific, it causes a problem for building tei_customization which wants to use the version number of the p5.xml that was used as its source as both the
@version
of its own<TEI>
and the@version
of a TEI document that conforms to it. (And, I daresay, all Exemplars should do that. But that’s a different ticket.)Possible solutions:
@version
of tei_customizatoin. I don’t like this because that means tei_customization is lying about its own pedigree.@version
of<TEI>
, or@version
of att.versioned to allow for an optional ultimate ‘a’, ‘b’, ‘α’, or ‘β’. I like this, but it does involve some work. First, we would have to decide on whether or not to keep@version
of<TEI>
and<unicodeName>
the same (allowing a trailing letter where none should be allowed in the case of<unicodeName>
), or have<unicodeName>
inherit then modify the attribute so it is further restricted. Second, we would no longer be able to rely soley on the Unicode definition of a version number.