Closed mccaskey closed 2 years ago
I think the Guidelines are correct (and should be made more clear) and I think the addition of model.resourceLike to the content model as a whole here is wrong. If one is going to use teiCorpus to then have some alternating teiHeader and text elements in it, it is far better to wrap those in a TEI element and hardly any problem.
I believe this is in error when we were making changes to membership of model.resourceLike and added text to it in release 3.0.0. Compare: http://www.tei-c.org/Vault/P5/3.0.0/doc/tei-p5-doc/en/html/ref-model.resourceLike.html and http://www.tei-c.org/Vault/P5/2.9.0/doc/tei-p5-doc/en/html/ref-model.resourceLike.html
While its use inside TEI makes sense, its use in the content model of teiCorpus doesn't. (IMHO).
I think I agree w/ @jamescummings, here. I think adding model.resourceLike+ to the before-TEI portion of the content model (which was done in 2.4.0) made sense, but
a) The content model should have been changed to teiHeader, model.resourceLike*, ( TEI | teiCorpus )+
— that is, the <TEI>
(or nested <teiCorpus>
) bit should always be required, not optional as we have now, and maybe
b) a Schematron rule should have said “no <text>
child of <teiCorpus>
”.
However, @jamescummings, note that you can’t have alternating <teiHeader>
and <text>
elements inside a <teiCorpus>
, even with this current bad content model.
I admittedly never(?) used teiCorpus
but always thought of it as a pure wrapper for TEI
or teiCorpus
elements – and this is how I understand the remark: "Must contain one TEI header for the corpus, and a series of TEI, one for each text."
Yet, in 2.4.0 it was decided to "allow a <teiCorpus>
to have <facsimile>
and <sourceDoc>
children, like <TEI>
(per FR 456)".
What do we do with this? I think it was agreed on adding model.resourceLike
to the content model of <teiCorpus>
--> hence we should alter the remark to reflect this. Second, when model.resourceLike
was added to <teiCorpus>
, <text>
was not a member of this class and I think (reading the original proposal) it was not the idea to have no following <TEI>
element. So, I completely agree with @sydb's proposed content model :)
F2F subgroup discussion:
Two main issues in teiCorpus content model:
currently structures like (teiCorpus with no nested TEI or teiCorpus children) are perfectly valid
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
...
</teiHeader>
<text>
...
</text>
</teiCorpus>
Subgroup suggests changing the content model to exclude text
from teiCorpus and prescribe one or more TEI | teiCorpus children
I think most of the work has been done, except the proposed exclusion of <text>
as a direct child of <teiCorpus>
(see comment above).
To finish off this ticket I'd argue for not altering the content model nor adding a schematron constraint but rather update the description of <teiCorpus>
to state: "contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, optionally including other corpus related resources as direct children."
Council subgroup thinks, although we have addressed the OP complaint, we should (at least for now) go with @peterstadler’s suggestion to not exclude <text>
(nor <sourceDoc>
) from the content of <teiCorpus>
, but should provide a health warning about it. Something like
“For historical reasons, <text>
and <sourceDoc>
are permitted as valid children of <teiCorpus>
but this use is discouraged.”
We are not entirely sure where this warning should go, though. And we note that the <remarks>
of <teiCorpus>
needs to be updated, too (in addition to the <desc>
).
@sydb we should say why something is discouraged, rather than just leaving a mysterious statement.
F2F subgroup discussed this and suggests removing the note in teiCorpus spec and replacing it with a "positive encouragement" to use teiCorpus in a way that's aligned with its original concept, perhaps along these lines:
teiCorpus
is envisioned as an element to represent composite texts, for which their systematic collection, standardized preparation, and common markup make it useful to treat the entire corpus of individual texts as a unit. Therefore in practical encodingteiCorpus
aggregates a shared collection header with a number of TEI resources (represented as nestedTEI
orteiCorpus
elements). While, for historical reasons,text
andsourceDoc
are permitted as valid children ofteiCorpus
, their use in this position is strongly advised against without a specific justification
Btw: both Guidelines and teiCorpus reference page specifically state
teiCorpus contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.
which is no longer true with changes to the teiCorpus data model in past years
VF2F suggestion: alter the ref page description just to:
… contains the whole of a TEI encoded corpus,
usually comprising a single corpus header
and one or more <gi>TEI</gi> elements,
each of which contains a text and a header for that text.
The proposed wording above needs some stylistic/linguistic attention. What is "practical encoding", for example? do the Guidelines ever recommend impractical encoding? And why say "aggregate" when "combine" will do the job? But chiefly it's not "for historical reasons" that teiCorpus currently permits the madness of direct text or sourceDoc children; it's because the content model has been carelessly modified and poorly implemented, for reasons which continue to elude me.
Heh-heh. While I agree with you, @lb42, that the phrase “practical encoding” should probably be changed, seems to me the Guidelines often recommend impractical encoding 😀
And yes, “combine” is probably better than “aggregate”.
But while I agree completely that <text>
and <sourceDoc>
snuck into the content of <teiCorpus>
due to careless class and content model manipulation, I think explicitly saying that in the Guidelines themselves is probably not an ideal way to build confidence in the user community. And these changes did occur years ago,[1] so “historical reasons” is not wrong, even though “historical and embarrassingly silly reasons” might be more precise.
[1] Content of <teiCorpus>
changed to include model.resourceLike in 2013-06 by @sebastianrahtz; <text>
changed to be a member of model.resourceLike 2016-02, merged by @hcayless (not sure who made change).
So, I've been digging into this, and I'm not sure anymore what the fuss is about. Sebastian proposed, back in 2013, that there were circumstances where you might want a <facsimile>
at the corpus level rather than at the document level. His use case was a map of a graveyard, while the corpus contained <TEI>
elements for each recorded grave inscription. <sourceDoc>
and <fsDecl>
were added by extension, without much discussion.
Two thoughts follow from this: 1) <sourceDoc>
and <text>
really operate at the same "level." So if the former can appear in <teiCorpus>
, why not the latter? 2) I don't see what would be wrong with having a prose discussion of the corpus at the corpus level, and I would put that thing in a <text>
. I am already finding it useful to have a <TEI>
document that can contain both <text>
and <TEI>
elements.
Maybe the best thing to do is not to be squeamish about it but to explain what it would mean to have <text>
as a direct child of <teiCorpus>
. To be quite honest, I'm having a far easier time imagining how to explain that than I would <sourceDoc>
in a similar position.
The fuss is, as Syd says, because while there was quite a consensus in favour of allowing a teiCorpus to contain facsimile, and arguably sourceDoc, the addition of <text>
to model.resourceLike has had the side effect of allowing it t
also to contain a <text>
. This is problematic because it means there are now two places for me to put header information for this <text>
(either somewhere in the teiCorpus/teiHeader, or in its own teiHeader). Unless of course I do the wrong thing and just don't supply any header information for it all. Maybe @hcayless could elaborate on what sort of "prose discussion of the corpus" is envisaged and why this wouldn't be in the teiCorpus header ? If it's something like "the corpus manual", surely to goodness that would be a classic TEI document with its own TEIheader.
What I'm imagining is a <text>
that is a property of the TEI Corpus, but not a member document. To take Sebastian's original use case, I could imagine writing an essay about the history of the project, the graveyard in question, etc.. I will happily grant you that the obvious way to include such a thing at the time he made the request would have been as its own TEI document, but since that's no longer a requirement, it seems perfectly natural to me just to put it in a <text>
directly inside <teiCorpus>
.
I am glad you're granting the point that the obvious way to include a free standing textual document in a teiCorpus is as a <TEI>
element, not as a <text>
! The fact that somewhat careless modification has now made it possible to do it another way isn't an argument for that being a good way to do it though. Your distinction between something which is "a property of the TEI corpus" and something which is a "member document" is interesting though: your "essay about the project" is surely a bit of metadata (hmm there's even a TEI header element for it) rather than a property of the data aggregate isn't it? I say aggregate because although this discussion specifies <teiCorpus>
, it applies equally to your newfangled recursive <TEI>
element.
I assume you're talking about <projectDesc>
? That would be fine, as far as it goes, but what if I wanted to include a proper article, or something even longer? What if it needed chapters? Let me try to be precise, and see if it gets us anywhere:
If I have an element /teiCorpus/teiHeader
, that teiHeader
is presumably about the corpus, not so much about any one of its members. A teiCorpus/TEI/teiHeader
is about the TEI document, which is a part of the TEI corpus. I think we'd probably all agree that /TEI
: /TEI/teiHeader
:: /teiCorpus
: /teiCorpus/teiHeader
.
What then is the relationship between /TEI
and /TEI/text
? We might not like to say that the text is about the TEI document. Probably better to say it provides the substance of the TEI document. SourceDoc and facsimile do the same sort of thing in their own ways.
But do we have to say that the relationship /TEI
: /TEI/text
:: /teiCorpus
: /teiCorpus/text
? I think not...I would say that if you have a /teiCorpus/text
, then that text is not the substance of the teiCorpus (that would be its TEI children), but is, like the teiHeader, about the teiCorpus.
As things are currently presented in the GL, there is no difference in semantics between a teiCorpus containing two TEI elements and a TEI element containing two TEI elements. There is no explanation of why you might (or might not) decide to use a TEI rather than a teiCorpus as the outermost element of your aggregate. There is no discussion of the different intentions we may assume for <text>
elements according to the various contexts in which they may now pop up. This is not good.
I entirely agree with you that there is a difference between the relationship "is-about" and the relationship "is-part-of" : that's precisely why we have a teiHeader and a lot more beside. The distinction is blurry though. Digital editions in particular love to aggregate together all sorts of bits and pieces -- translations, images, commentary ancient and modern, maps, teaching notes ... and yes, metadata too. But that isn't really the issue. For me, one of the most important things the TEI did was to insist that however you chose to define your "text" , it should be formally and structurally distinct from your "text-description", and BOTH SHOULD BE THERE.
Go back to your example of the "proper article" you want to include as a component in your "text" (or corpus). Is it unreasonable of me to wish you'd make it a really proper article, with its own TEI Header, so I can check out its revision history, encoding description, and all the rest?
@lb42 It's not unreasonable, but I might make the argument that (e.g.) a contextualizing essay is not a member of the corpus, but an adjunct to it. To return again to the corpus of grave inscriptions, if I was doing that project, I might feel weird about including things that weren't grave inscriptions as members of the corpus. With text as a child of teiCorpus, I have the option of distinguishing prose associated with the corpus from documents that are members of the corpus.
My argument is simply that it's not obvious to me that /teiCorpus/text
is a bad thing to be avoided and never spoken of. There are valid reasons why you might want to use it and we should explain those.
Nothing to stop you having a teiCorpus containing one nested teiCorpus (contains only grave inscriptions) alongside another one (contains contextual documents).
To make clearer why I think teiCorpus/text is a bad idea, maybe I should rephrase it as "text without any sibling teiHeader". That's what I really don't like.
It's clear that you don't like it :-). But that ship has sailed. Really, it sailed when sourceDoc was allowed into teiCorpus. I see sourceDoc and text as moral equivalents--they both represent the body of an edition. Arguably facsimile does too, for that matter.
I see our choices here as
teiCorpus/text
violates the spirit, even if not the letter, of the law.My preferences would be 3, 1, 2.
The requirement to have a teiHeader along with a text element is a pretty fundamental one. You say that the ship has sailed, but it seems to be going into dangerous and uncharted waters.
My (pretty strong) order of preference is 1, 2, 3. And I would be quite happy to ostracize <sourceDoc>
(and maybe even <facsimile>
) out of the realm of not having a sibling <teiHeader>
, too.
That is, although I can see the lure of @hcayless’s musing that there might be an occasional use for a text resource that is not described by its own <teiHeader>
, it is the Sirens’ lure.
(I disagree with @lb42, though: these waters are dangerous, but not uncharted. 😀 Most every other text representation system on the planet has made the mistake of not linking a representation’s metadata tightly enough with it itself. This is something TEI got right, and should try to stick to. Just because I am vaguely in favor of independent headers does not mean I am at all in favor of independent <text>
s (or <sourceDoc>
s, and maybe <facimile>
s).)
I’m a little puzzled as to why you all think these are unmoored text elements. Surely their sibling teiHeader applies to them as usual—it’s just doing other stuff too, no?
In the use case originally proposed, we have a bunch of transcribed inscriptions, and a map. At the corpus level, it's clear that (say) generic statements about transcription policy belong in the teiCorpus/teiHeader, while individual titles, or even responsibility statements or revision descriptions should be supplied at the teiCorpus/TEI/teiHeader level. Sebastian's argument was that the map (because it related to the whole collection) should have its metadata at the teiCorpus/teiHeader level. I think I said at the time that this logic implied that if you had individual detail maps at the item level as well then these would need to go in the appropriate teiCorpus/TEI unless you wanted to commit yourself to extensive use of the decls mechanism (which no-one but me has ever understood). Anyway, Hugh's argument, if I understand correctly, is that a <text>
containing an article about the whole shebang stands in the same relation to the rest as the map, and that therefore its metadata should go into the teiCorpus/teiHeader, just as the metadata for the teiCorpus/facsimile containing the map does. This has a certain plausibility, but I think it evaporates as soon as you start thinking about what would happen in practice. Lets say the corpus contains the version of the article we wrote for the grant. Now the project is finished, we're going to revise the article quite a lot, and maybe tweak some of the responsibility statements, take out some specious claims and add others etc. etc. We might wish to keep the old version as part of the history of the project, we might wish to wipe it from the face of the earth. Either way we have a lot of metadata mangling to do, all of which relates solely to this wretched text and none of which has to do with the components of the corpus. At the same time, we might decide to remove or update some of the inscriptions. Now we are going to have a mixture of change elements in the revisionDesc, some relating to the corpus composition, some relating to its documentation. And just to make life more exciting, let's suppose one of our project collaborators is of Syd's persuasion (and mine) and has therefore been storing the documentation for which they are responsible as a separate TEI element inside the teiCorpus. There is a word for this and it's not a pretty one.
I have sympathies for @hcayless's option 1 "Try to put the toothpaste back in the tube, and deprecate/remove text and sourceDoc from teiCorpus", as well. While I see the point in the graveyard example (for allowing <facsimile>
as a direct child of <teiCorpus>
) I'd be even more rigid (now) and say that every resource should be accompanied by its own teiHeader. Allowing <facsimile>
to be free floating/pluggable is just laziness by text-oriented people …
Yet, it seems we are not reaching a consensus here before the upcoming release so I'll remove the ticket from the milestone and degrade it to "needs discussion".
Just expressing a small worry here that delaying discussion of this means increased difficulty for us further down the road. If text as a direct child of teiCorpus is both legal and has reasonable use cases, then people are likely to do it and then become annoyed at having it taken away. If we think we're going to deprecate these things, it should happen sooner rather than later.
For my part, I can't really follow the logic behind the arguments for tighter restrictions. Lou's point about increased isolation of components being a good way to managed increased complexity is a sensible one, but I don't think that applies to every possible situation.
With the introduction of model.describedResource
the original issue is resolved: the guidelines prose and data model regarding teiCorpus are now consistent
All the guidelines, all the examples, and the note in the spec for the element say that a
<teiCorpus>
element contains (“must contain” in the note)<TEI>
elements. But the data model allows<teiCorpus>
to contain no<TEI>
elements and a<text>
element instead.