Open martindholmes opened 5 years ago
This would mean changing the actual names of the template files on disk:
tei\templates\TEI P5\Corpus.xml tei\templates\TEI P5\Corpus.properties
What is meant by "the full content model" here? the content model you would get if you used TEI all instead? But in that case, what's the point of defining this ODD at all? Obviously, renaming it is not a bad idea, though if you actually read the ODD text it's pretty explicit!
Better yet, it would mean getting rid of such a useless “customization” altogether. (And to be clear, I don’t really mean getting rid of the Corpus template that @raducoravu mentions above, but rather the ridiculous customization upon which it is based.) I call these customizations that TEI-C makes publicly available, and IMHO should mostly not, “template” customizations. See this article paragraphs 26, 27, & 28 for a list of them. That is to say, in response to @lb42’s question which he posted while I was writing this:
what's the point of defining this ODD [tei_corpus.odd] at all?
There is none, it should not be defined. At least not in the P5/Exemplars/ directory.
I think there's value in providing a decent starting-point for people working on specific types of text. There are two advantages: first, those who cannot and will never customize anything are at least not consigned to working with tei_all; and second, those who can and will customize have a decent example/starting point that they can work from.
Those who "never will customize" may also perhaps learn why this is a suboptimal strategy by being forced to confront the limitations of the existing exemplars!
It would be interesting to know how many linguistic corpora have been based on this customization. I've always started with TEI Bare myself, but I fully agree with Martin that 'thematic' customizations are useful, and further, they can indeed constitute selling points for some communities.
In the general spirit of how/where this thread is moving, I would say that, especially given the recent and ongoing developments in ISO-TEI standards, it would be good to offer a (nearly) out-of-the-box modern corpus-linguistic customization as a starting point for linguists.
Would it be acceptable to (a) label this customization as "under reconstruction", (b) not allow it to pop up as the first choice the way it recently did for Marjorie and (c) initiate work on its thorough update/upgrade? I'm in a pretty good position to to suggest insights from the point of view of ISO, in a process that includes the CLARIN community (so that it's not one man's proposal but rather a community baby). And I believe that I should be able to allocate time for that in my schedule (I'd need to verify that, of course, but chances are good).
I don't think we should label anything that's part of the TEI release or the Oxygen plugin as "under construction". The first thing to do is I think to re-label this as Linguistic Corpus, and then anyone qualified can work on making it better. The second question for me is whether any kind of parallel "Non-linguistic corpus" schema should be available, or whether we just assume people will then choose tei_all.
@martindholmes:
I think there's value in providing a decent starting-point for people working on specific types of text.
Agreed. But this (P5/Exemplars/tei_corpus.odd and oxygen-tei/frameworks/tei/templates/TEI P5/Corpus.*) is not even remotely close to a decent starting point. It is better characterized as indecent.
@lb42: A good idea, in theory, but what in fact happens is people use the template ODDs in Exemplars/ as end-point schemas. Which is worse, I think.
@bansp:
it would be good to offer a (nearly) out-of-the-box modern corpus-linguistic customization as a starting point for linguists.
Agreed. If you can find the time, I’d probably prefer almost anything you come up with. (But @martindholmes is right, it’s not OK to label an Exemplar/ as under construction, especially one that isn’t. I think we should just get rid of the current tei_corpus.odd for now, and do without until you or someone else comes up with something better. But I’ve thought we should get rid of these “template” exemplars for years, and it hasn’t happened yet. So I’m not going to hold my breath.)
Good point about "under construction" in a production deliverable.
I expect this table to be edited in-place, here in this comment. (This probably is not the best place for this information, but I’ve already done it, so …)
name | category | comments | @n |
title | author | # nodes | # elements (TEI/all) |
---|---|---|---|---|---|---|---|
isofs.odd | demo | just starting point of <fs> , <fLib> , or <fvLib> |
isofs | ISO Feature Structures freestanding schema | TEI-C | 76 | 27 / 28 |
tei_all.odd | necessary | has useful prose description of its limitations | TEI with maximal setup | SR | 112 | 42 / 42 | |
tei_allPlus.odd | demo | all + SVG + MathML; refers to defunct declarefs module | TEI with maximal setup, plus external additions | SR | 150 | 51 / 55 | |
tei_bare.odd | sample demo | exemplifies extreme thinning and use of <specGrp> s |
TEI Absolutely Bare | LR | 388 | 145 / 145 | |
tei_corpus.odd | template | TEI for Linguistic Corpora | SR | 71 | 25 / 25 | ||
tei_customization.odd | sample generated | TEI ODD Customization for writing TEI ODD Customizations | SB | 20840 | 7369 / 7583 | ||
tei_dictionaries.odd | template | TEI with minimal setup for dictionaries | 85 | 30 / 30 | |||
tei_docs.odd | template | TEI for documentation | LB | 74 | 25 / 25 | ||
tei_drama.odd | template | mildly useful: demonstrates attribute deletion (poorly) | TEI with Drama | SB SR | 146 | 52 / 52 | |
tei_enrich.odd | sample | enrich | TEI P5 schema for ENRICH | [SR,LB,JC] | 6053 | 1362 / 2108 | |
tei_its.odd | demo | testminimal | TEI with ITS setup | SR | 124 | 41 / 45 | |
tei_jtei.odd | sample | important | jTEI input customization | RVB MH | 5597 | 1322 / 1975 | |
tei_lite_fr.odd | sample | Encoder pour échanger : une introduction à la TEI | LB MSM | 5531 | 1226 / 2044 | ||
tei_lite.odd | sample | Encoding for Interchange: an introduction to the TEI | 6044 | 1283 / 2061 | |||
tei_math.odd | template | required for tei_allPlus | TEI with MathML | SR | 113 | 41 / 41 | |
tei_minimal.odd | template | nothing but “the ten required elements”; no attribute deletion | TEI Minimal | JC | 149 | 52 / 52 | |
tei_ms.odd | template | testms | TEI for Manuscript Description | SR | 99 | 35 / 35 | |
tei_odds.odd | template | TEI for authoring ODD | SR | 99 | 37 / 37 | ||
tei_simplePrint.odd | sample | An Introduction to TEI simplePrint | 10138 | 2554 / 3592 | |||
tei_speech.odd | template | TEI for Speech Representation | LR | 110 | 38 / 38 | ||
tei_svg.odd | template | required for tei_allPlus | tei_svg | TEI with SVG | SR | 120 | 37 / 39 |
tei_tite.odd | sample | TEI TiteA recommendation for off-site text encoding | PT | 2530 | 810 / 854 | ||
tei_xinclude.odd | demo | tei_xinclude | TEI with XInclude (experimental) | SR | 208 | 75 / 75 |
Council VF2F: We should proceed by:
I would like to work on an exemplar that would build a text corpus template from scratch, based strictly on the ISO TC37 SC4 WG6 standards family (MAF, SynAF, LAF; possibly Speech but not as a direct goal), and that would be conditioned on the potential future updates to those standards, which treat the Birnbaum doctrine as a practical guideline rather than a necessary constraint. But, given versioning, and with a clear statement on the conformance target, I think it should be acceptable.
I'd work on that in the LingSIG fork, so that the work space would be open both to ISO experts and to practitioners from individual corpus projects, and to the Council, and to SIG members.
Would that be an acceptable plan?
I'm seeking a "stamp of approval" by the Council, that may be expressed by an assignment of an issue. So that I could cite that as justification for the time spent on this, and as basis for encouraging colleagues from DIN/ISO to potentially contribute (so that they would know that they are not asked to contribute to a pet project).
I don’t understand the conditionality — when those standards get updated, how do we know, and what do we do with the exemplar? — But otherwise it seems like an excellent idea to me.
I may have compressed too much into the message. Perhaps there is no conditionality to speak of, but rather something obvious -- the exemplar would be naturally tied to concrete versions of the relevant standards, and as long as I were in a position to, I would make sure to keep them in sync, else I would hope that someone representing TC37 SC4 WG6 would, simply because such an exemplar would be a handy reference implementation for some of the work done by the committee/WG. If a need arose, the previous versions could surely be stored somewhere accessible, e.g. in the LingSIG github, etc. And I'm happy that you like the idea, Syd! :-)
Council looking at another ticket realized that if in Oxygen you do File / New and choose Corpus from the TEI P5 options, you get something which doesn't support the full content model of teiCorpus. This is "correct" because that Exemplar is intended only for linguistic corpora; however, that's not clear from the generic "Corpus" label. That label should be changed to "Linguistic Corpus".