TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
279 stars 88 forks source link

teiCorpus: make guidelines and data model consistent #1823

Closed mccaskey closed 2 years ago

mccaskey commented 6 years ago

All the guidelines, all the examples, and the note in the spec for the element say that a <teiCorpus> element contains (“must contain” in the note) <TEI> elements. But the data model allows <teiCorpus> to contain no <TEI> elements and a <text> element instead.

jamescummings commented 6 years ago

I think the Guidelines are correct (and should be made more clear) and I think the addition of model.resourceLike to the content model as a whole here is wrong. If one is going to use teiCorpus to then have some alternating teiHeader and text elements in it, it is far better to wrap those in a TEI element and hardly any problem.

I believe this is in error when we were making changes to membership of model.resourceLike and added text to it in release 3.0.0. Compare: http://www.tei-c.org/Vault/P5/3.0.0/doc/tei-p5-doc/en/html/ref-model.resourceLike.html and http://www.tei-c.org/Vault/P5/2.9.0/doc/tei-p5-doc/en/html/ref-model.resourceLike.html

While its use inside TEI makes sense, its use in the content model of teiCorpus doesn't. (IMHO).

sydb commented 6 years ago

I think I agree w/ @jamescummings, here. I think adding model.resourceLike+ to the before-TEI portion of the content model (which was done in 2.4.0) made sense, but a) The content model should have been changed to teiHeader, model.resourceLike*, ( TEI | teiCorpus )+ — that is, the <TEI> (or nested <teiCorpus>) bit should always be required, not optional as we have now, and maybe b) a Schematron rule should have said “no <text> child of <teiCorpus>”.

However, @jamescummings, note that you can’t have alternating <teiHeader> and <text> elements inside a <teiCorpus>, even with this current bad content model.

peterstadler commented 5 years ago

I admittedly never(?) used teiCorpus but always thought of it as a pure wrapper for TEI or teiCorpus elements – and this is how I understand the remark: "Must contain one TEI header for the corpus, and a series of TEI, one for each text."

Yet, in 2.4.0 it was decided to "allow a <teiCorpus> to have <facsimile> and <sourceDoc> children, like <TEI> (per FR 456)".

What do we do with this? I think it was agreed on adding model.resourceLike to the content model of <teiCorpus> --> hence we should alter the remark to reflect this. Second, when model.resourceLike was added to <teiCorpus>, <text> was not a member of this class and I think (reading the original proposal) it was not the idea to have no following <TEI> element. So, I completely agree with @sydb's proposed content model :)

tuurma commented 5 years ago

F2F subgroup discussion:

Two main issues in teiCorpus content model:

currently structures like (teiCorpus with no nested TEI or teiCorpus children) are perfectly valid

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
      ...
  </teiHeader>
  <text>
     ...
  </text>
</teiCorpus>

Subgroup suggests changing the content model to exclude text from teiCorpus and prescribe one or more TEI | teiCorpus children

peterstadler commented 4 years ago

I think most of the work has been done, except the proposed exclusion of <text> as a direct child of <teiCorpus> (see comment above). To finish off this ticket I'd argue for not altering the content model nor adding a schematron constraint but rather update the description of <teiCorpus> to state: "contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, optionally including other corpus related resources as direct children."

sydb commented 3 years ago

Council subgroup thinks, although we have addressed the OP complaint, we should (at least for now) go with @peterstadler’s suggestion to not exclude <text> (nor <sourceDoc>) from the content of <teiCorpus>, but should provide a health warning about it. Something like

“For historical reasons, <text> and <sourceDoc> are permitted as valid children of <teiCorpus> but this use is discouraged.”

We are not entirely sure where this warning should go, though. And we note that the <remarks> of <teiCorpus> needs to be updated, too (in addition to the <desc>).

npcole commented 3 years ago

@sydb we should say why something is discouraged, rather than just leaving a mysterious statement.

tuurma commented 3 years ago

F2F subgroup discussed this and suggests removing the note in teiCorpus spec and replacing it with a "positive encouragement" to use teiCorpus in a way that's aligned with its original concept, perhaps along these lines:

teiCorpus is envisioned as an element to represent composite texts, for which their systematic collection, standardized preparation, and common markup make it useful to treat the entire corpus of individual texts as a unit. Therefore in practical encoding teiCorpus aggregates a shared collection header with a number of TEI resources (represented as nested TEI or teiCorpus elements). While, for historical reasons, text and sourceDoc are permitted as valid children of teiCorpus, their use in this position is strongly advised against without a specific justification

tuurma commented 3 years ago

Btw: both Guidelines and teiCorpus reference page specifically state

teiCorpus contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

which is no longer true with changes to the teiCorpus data model in past years

ebeshero commented 3 years ago

VF2F suggestion: alter the ref page description just to:

… contains the whole of a TEI encoded corpus, 
usually comprising a single corpus header 
and one or more <gi>TEI</gi> elements, 
each of which contains a text and a header for that text.
lb42 commented 3 years ago

The proposed wording above needs some stylistic/linguistic attention. What is "practical encoding", for example? do the Guidelines ever recommend impractical encoding? And why say "aggregate" when "combine" will do the job? But chiefly it's not "for historical reasons" that teiCorpus currently permits the madness of direct text or sourceDoc children; it's because the content model has been carelessly modified and poorly implemented, for reasons which continue to elude me.

sydb commented 3 years ago

Heh-heh. While I agree with you, @lb42, that the phrase “practical encoding” should probably be changed, seems to me the Guidelines often recommend impractical encoding 😀

And yes, “combine” is probably better than “aggregate”.

But while I agree completely that <text> and <sourceDoc> snuck into the content of <teiCorpus> due to careless class and content model manipulation, I think explicitly saying that in the Guidelines themselves is probably not an ideal way to build confidence in the user community. And these changes did occur years ago,[1] so “historical reasons” is not wrong, even though “historical and embarrassingly silly reasons” might be more precise.


[1] Content of <teiCorpus> changed to include model.resourceLike in 2013-06 by @sebastianrahtz; <text> changed to be a member of model.resourceLike 2016-02, merged by @hcayless (not sure who made change).

hcayless commented 3 years ago

So, I've been digging into this, and I'm not sure anymore what the fuss is about. Sebastian proposed, back in 2013, that there were circumstances where you might want a <facsimile> at the corpus level rather than at the document level. His use case was a map of a graveyard, while the corpus contained <TEI> elements for each recorded grave inscription. <sourceDoc> and <fsDecl> were added by extension, without much discussion.

Two thoughts follow from this: 1) <sourceDoc> and <text> really operate at the same "level." So if the former can appear in <teiCorpus>, why not the latter? 2) I don't see what would be wrong with having a prose discussion of the corpus at the corpus level, and I would put that thing in a <text>. I am already finding it useful to have a <TEI> document that can contain both <text> and <TEI> elements.

Maybe the best thing to do is not to be squeamish about it but to explain what it would mean to have <text> as a direct child of <teiCorpus>. To be quite honest, I'm having a far easier time imagining how to explain that than I would <sourceDoc> in a similar position.

lb42 commented 3 years ago

The fuss is, as Syd says, because while there was quite a consensus in favour of allowing a teiCorpus to contain facsimile, and arguably sourceDoc, the addition of <text> to model.resourceLike has had the side effect of allowing it t also to contain a <text>. This is problematic because it means there are now two places for me to put header information for this <text> (either somewhere in the teiCorpus/teiHeader, or in its own teiHeader). Unless of course I do the wrong thing and just don't supply any header information for it all. Maybe @hcayless could elaborate on what sort of "prose discussion of the corpus" is envisaged and why this wouldn't be in the teiCorpus header ? If it's something like "the corpus manual", surely to goodness that would be a classic TEI document with its own TEIheader.

hcayless commented 3 years ago

What I'm imagining is a <text> that is a property of the TEI Corpus, but not a member document. To take Sebastian's original use case, I could imagine writing an essay about the history of the project, the graveyard in question, etc.. I will happily grant you that the obvious way to include such a thing at the time he made the request would have been as its own TEI document, but since that's no longer a requirement, it seems perfectly natural to me just to put it in a <text> directly inside <teiCorpus>.

lb42 commented 3 years ago

I am glad you're granting the point that the obvious way to include a free standing textual document in a teiCorpus is as a <TEI> element, not as a <text> ! The fact that somewhat careless modification has now made it possible to do it another way isn't an argument for that being a good way to do it though. Your distinction between something which is "a property of the TEI corpus" and something which is a "member document" is interesting though: your "essay about the project" is surely a bit of metadata (hmm there's even a TEI header element for it) rather than a property of the data aggregate isn't it? I say aggregate because although this discussion specifies <teiCorpus>, it applies equally to your newfangled recursive <TEI> element.

hcayless commented 3 years ago

I assume you're talking about <projectDesc>? That would be fine, as far as it goes, but what if I wanted to include a proper article, or something even longer? What if it needed chapters? Let me try to be precise, and see if it gets us anywhere:

If I have an element /teiCorpus/teiHeader, that teiHeader is presumably about the corpus, not so much about any one of its members. A teiCorpus/TEI/teiHeader is about the TEI document, which is a part of the TEI corpus. I think we'd probably all agree that /TEI : /TEI/teiHeader :: /teiCorpus : /teiCorpus/teiHeader.

What then is the relationship between /TEI and /TEI/text? We might not like to say that the text is about the TEI document. Probably better to say it provides the substance of the TEI document. SourceDoc and facsimile do the same sort of thing in their own ways.

But do we have to say that the relationship /TEI : /TEI/text :: /teiCorpus : /teiCorpus/text? I think not...I would say that if you have a /teiCorpus/text, then that text is not the substance of the teiCorpus (that would be its TEI children), but is, like the teiHeader, about the teiCorpus.

lb42 commented 3 years ago

As things are currently presented in the GL, there is no difference in semantics between a teiCorpus containing two TEI elements and a TEI element containing two TEI elements. There is no explanation of why you might (or might not) decide to use a TEI rather than a teiCorpus as the outermost element of your aggregate. There is no discussion of the different intentions we may assume for <text> elements according to the various contexts in which they may now pop up. This is not good.

I entirely agree with you that there is a difference between the relationship "is-about" and the relationship "is-part-of" : that's precisely why we have a teiHeader and a lot more beside. The distinction is blurry though. Digital editions in particular love to aggregate together all sorts of bits and pieces -- translations, images, commentary ancient and modern, maps, teaching notes ... and yes, metadata too. But that isn't really the issue. For me, one of the most important things the TEI did was to insist that however you chose to define your "text" , it should be formally and structurally distinct from your "text-description", and BOTH SHOULD BE THERE.

Go back to your example of the "proper article" you want to include as a component in your "text" (or corpus). Is it unreasonable of me to wish you'd make it a really proper article, with its own TEI Header, so I can check out its revision history, encoding description, and all the rest?

hcayless commented 3 years ago

@lb42 It's not unreasonable, but I might make the argument that (e.g.) a contextualizing essay is not a member of the corpus, but an adjunct to it. To return again to the corpus of grave inscriptions, if I was doing that project, I might feel weird about including things that weren't grave inscriptions as members of the corpus. With text as a child of teiCorpus, I have the option of distinguishing prose associated with the corpus from documents that are members of the corpus.

My argument is simply that it's not obvious to me that /teiCorpus/text is a bad thing to be avoided and never spoken of. There are valid reasons why you might want to use it and we should explain those.

lb42 commented 3 years ago

Nothing to stop you having a teiCorpus containing one nested teiCorpus (contains only grave inscriptions) alongside another one (contains contextual documents).

To make clearer why I think teiCorpus/text is a bad idea, maybe I should rephrase it as "text without any sibling teiHeader". That's what I really don't like.

hcayless commented 3 years ago

It's clear that you don't like it :-). But that ship has sailed. Really, it sailed when sourceDoc was allowed into teiCorpus. I see sourceDoc and text as moral equivalents--they both represent the body of an edition. Arguably facsimile does too, for that matter.

I see our choices here as

  1. Try to put the toothpaste back in the tube, and deprecate/remove text and sourceDoc from teiCorpus.
  2. Write documentation arguing that teiCorpus/text violates the spirit, even if not the letter, of the law.
  3. Accept that we have an alternative way to provide text alongside corpus members and document it.

My preferences would be 3, 1, 2.

lb42 commented 3 years ago

The requirement to have a teiHeader along with a text element is a pretty fundamental one. You say that the ship has sailed, but it seems to be going into dangerous and uncharted waters.

sydb commented 3 years ago

My (pretty strong) order of preference is 1, 2, 3. And I would be quite happy to ostracize <sourceDoc> (and maybe even <facsimile>) out of the realm of not having a sibling <teiHeader>, too. That is, although I can see the lure of @hcayless’s musing that there might be an occasional use for a text resource that is not described by its own <teiHeader>, it is the Sirens’ lure. (I disagree with @lb42, though: these waters are dangerous, but not uncharted. 😀 Most every other text representation system on the planet has made the mistake of not linking a representation’s metadata tightly enough with it itself. This is something TEI got right, and should try to stick to. Just because I am vaguely in favor of independent headers does not mean I am at all in favor of independent <text>s (or <sourceDoc>s, and maybe <facimile>s).)

hcayless commented 3 years ago

I’m a little puzzled as to why you all think these are unmoored text elements. Surely their sibling teiHeader applies to them as usual—it’s just doing other stuff too, no?

lb42 commented 3 years ago

In the use case originally proposed, we have a bunch of transcribed inscriptions, and a map. At the corpus level, it's clear that (say) generic statements about transcription policy belong in the teiCorpus/teiHeader, while individual titles, or even responsibility statements or revision descriptions should be supplied at the teiCorpus/TEI/teiHeader level. Sebastian's argument was that the map (because it related to the whole collection) should have its metadata at the teiCorpus/teiHeader level. I think I said at the time that this logic implied that if you had individual detail maps at the item level as well then these would need to go in the appropriate teiCorpus/TEI unless you wanted to commit yourself to extensive use of the decls mechanism (which no-one but me has ever understood). Anyway, Hugh's argument, if I understand correctly, is that a <text> containing an article about the whole shebang stands in the same relation to the rest as the map, and that therefore its metadata should go into the teiCorpus/teiHeader, just as the metadata for the teiCorpus/facsimile containing the map does. This has a certain plausibility, but I think it evaporates as soon as you start thinking about what would happen in practice. Lets say the corpus contains the version of the article we wrote for the grant. Now the project is finished, we're going to revise the article quite a lot, and maybe tweak some of the responsibility statements, take out some specious claims and add others etc. etc. We might wish to keep the old version as part of the history of the project, we might wish to wipe it from the face of the earth. Either way we have a lot of metadata mangling to do, all of which relates solely to this wretched text and none of which has to do with the components of the corpus. At the same time, we might decide to remove or update some of the inscriptions. Now we are going to have a mixture of change elements in the revisionDesc, some relating to the corpus composition, some relating to its documentation. And just to make life more exciting, let's suppose one of our project collaborators is of Syd's persuasion (and mine) and has therefore been storing the documentation for which they are responsible as a separate TEI element inside the teiCorpus. There is a word for this and it's not a pretty one.

peterstadler commented 3 years ago

I have sympathies for @hcayless's option 1 "Try to put the toothpaste back in the tube, and deprecate/remove text and sourceDoc from teiCorpus", as well. While I see the point in the graveyard example (for allowing <facsimile> as a direct child of <teiCorpus>) I'd be even more rigid (now) and say that every resource should be accompanied by its own teiHeader. Allowing <facsimile> to be free floating/pluggable is just laziness by text-oriented people …

Yet, it seems we are not reaching a consensus here before the upcoming release so I'll remove the ticket from the milestone and degrade it to "needs discussion".

hcayless commented 3 years ago

Just expressing a small worry here that delaying discussion of this means increased difficulty for us further down the road. If text as a direct child of teiCorpus is both legal and has reasonable use cases, then people are likely to do it and then become annoyed at having it taken away. If we think we're going to deprecate these things, it should happen sooner rather than later.

For my part, I can't really follow the logic behind the arguments for tighter restrictions. Lou's point about increased isolation of components being a good way to managed increased complexity is a sensible one, but I don't think that applies to every possible situation.

peterstadler commented 2 years ago

With the introduction of model.describedResource the original issue is resolved: the guidelines prose and data model regarding teiCorpus are now consistent