TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
282 stars 84 forks source link

Corpus Exemplar should more explicitly state what it's for #1916

Open martindholmes opened 5 years ago

martindholmes commented 5 years ago

Council looking at another ticket realized that if in Oxygen you do File / New and choose Corpus from the TEI P5 options, you get something which doesn't support the full content model of teiCorpus. This is "correct" because that Exemplar is intended only for linguistic corpora; however, that's not clear from the generic "Corpus" label. That label should be changed to "Linguistic Corpus".

raducoravu commented 5 years ago

This would mean changing the actual names of the template files on disk:

tei\templates\TEI P5\Corpus.xml tei\templates\TEI P5\Corpus.properties

lb42 commented 4 years ago

What is meant by "the full content model" here? the content model you would get if you used TEI all instead? But in that case, what's the point of defining this ODD at all? Obviously, renaming it is not a bad idea, though if you actually read the ODD text it's pretty explicit!

sydb commented 4 years ago

Better yet, it would mean getting rid of such a useless “customization” altogether. (And to be clear, I don’t really mean getting rid of the Corpus template that @raducoravu mentions above, but rather the ridiculous customization upon which it is based.) I call these customizations that TEI-C makes publicly available, and IMHO should mostly not, “template” customizations. See this article paragraphs 26, 27, & 28 for a list of them. That is to say, in response to @lb42’s question which he posted while I was writing this:

what's the point of defining this ODD [tei_corpus.odd] at all?

There is none, it should not be defined. At least not in the P5/Exemplars/ directory.

martindholmes commented 4 years ago

I think there's value in providing a decent starting-point for people working on specific types of text. There are two advantages: first, those who cannot and will never customize anything are at least not consigned to working with tei_all; and second, those who can and will customize have a decent example/starting point that they can work from.

lb42 commented 4 years ago

Those who "never will customize" may also perhaps learn why this is a suboptimal strategy by being forced to confront the limitations of the existing exemplars!

bansp commented 4 years ago

It would be interesting to know how many linguistic corpora have been based on this customization. I've always started with TEI Bare myself, but I fully agree with Martin that 'thematic' customizations are useful, and further, they can indeed constitute selling points for some communities.

In the general spirit of how/where this thread is moving, I would say that, especially given the recent and ongoing developments in ISO-TEI standards, it would be good to offer a (nearly) out-of-the-box modern corpus-linguistic customization as a starting point for linguists.

Would it be acceptable to (a) label this customization as "under reconstruction", (b) not allow it to pop up as the first choice the way it recently did for Marjorie and (c) initiate work on its thorough update/upgrade? I'm in a pretty good position to to suggest insights from the point of view of ISO, in a process that includes the CLARIN community (so that it's not one man's proposal but rather a community baby). And I believe that I should be able to allocate time for that in my schedule (I'd need to verify that, of course, but chances are good).

martindholmes commented 4 years ago

I don't think we should label anything that's part of the TEI release or the Oxygen plugin as "under construction". The first thing to do is I think to re-label this as Linguistic Corpus, and then anyone qualified can work on making it better. The second question for me is whether any kind of parallel "Non-linguistic corpus" schema should be available, or whether we just assume people will then choose tei_all.

sydb commented 4 years ago

@martindholmes:

I think there's value in providing a decent starting-point for people working on specific types of text.

Agreed. But this (P5/Exemplars/tei_corpus.odd and oxygen-tei/frameworks/tei/templates/TEI P5/Corpus.*) is not even remotely close to a decent starting point. It is better characterized as indecent.

@lb42: A good idea, in theory, but what in fact happens is people use the template ODDs in Exemplars/ as end-point schemas. Which is worse, I think.

@bansp:

it would be good to offer a (nearly) out-of-the-box modern corpus-linguistic customization as a starting point for linguists.

Agreed. If you can find the time, I’d probably prefer almost anything you come up with. (But @martindholmes is right, it’s not OK to label an Exemplar/ as under construction, especially one that isn’t. I think we should just get rid of the current tei_corpus.odd for now, and do without until you or someone else comes up with something better. But I’ve thought we should get rid of these “template” exemplars for years, and it hasn’t happened yet. So I’m not going to hold my breath.)

bansp commented 4 years ago

Good point about "under construction" in a production deliverable.

sydb commented 4 years ago

I expect this table to be edited in-place, here in this comment. (This probably is not the best place for this information, but I’ve already done it, so …)

name category  comments  @n title author # nodes # elements (TEI/all)
isofs.odd demo just starting point of <fs>, <fLib>, or <fvLib> isofs ISO Feature Structures freestanding schema TEI-C 76 27 / 28
tei_all.odd necessary has useful prose description of its limitations TEI with maximal setup SR 112 42 / 42
tei_allPlus.odd demo all + SVG + MathML; refers to defunct declarefs module TEI with maximal setup, plus external additions SR 150 51 / 55
tei_bare.odd sample demo exemplifies extreme thinning and use of <specGrp>s TEI Absolutely Bare LR 388 145 / 145
tei_corpus.odd template TEI for Linguistic Corpora SR 71 25 / 25
tei_customization.odd sample generated TEI ODD Customization for writing TEI ODD Customizations SB 20840 7369 / 7583
tei_dictionaries.odd template TEI with minimal setup for dictionaries 85 30 / 30
tei_docs.odd template TEI for documentation LB 74 25 / 25
tei_drama.odd template mildly useful: demonstrates attribute deletion (poorly) TEI with Drama SB SR 146 52 / 52
tei_enrich.odd sample enrich TEI P5 schema for ENRICH [SR,LB,JC] 6053 1362 / 2108
tei_its.odd demo testminimal TEI with ITS setup SR 124 41 / 45
tei_jtei.odd sample important jTEI input customization RVB MH 5597 1322 / 1975
tei_lite_fr.odd sample Encoder pour échanger : une introduction à la TEI LB MSM 5531 1226 / 2044
tei_lite.odd sample Encoding for Interchange: an introduction to the TEI 6044 1283 / 2061
tei_math.odd template required for tei_allPlus TEI with MathML SR 113 41 / 41
tei_minimal.odd template nothing but “the ten required elements”; no attribute deletion TEI Minimal JC 149 52 / 52
tei_ms.odd template testms TEI for Manuscript Description SR 99 35 / 35
tei_odds.odd template TEI for authoring ODD SR 99 37 / 37
tei_simplePrint.odd sample An Introduction to TEI simplePrint 10138 2554 / 3592
tei_speech.odd template TEI for Speech Representation LR 110 38 / 38
tei_svg.odd template required for tei_allPlus tei_svg TEI with SVG SR 120 37 / 39
tei_tite.odd sample TEI TiteA recommendation for off-site text encoding PT 2530 810 / 854
tei_xinclude.odd demo tei_xinclude TEI with XInclude (experimental) SR 208 75 / 75
ebeshero commented 3 years ago

See also https://github.com/TEIC/TEI/issues/1572

ebeshero commented 3 years ago

Council VF2F: We should proceed by:

bansp commented 2 years ago

I would like to work on an exemplar that would build a text corpus template from scratch, based strictly on the ISO TC37 SC4 WG6 standards family (MAF, SynAF, LAF; possibly Speech but not as a direct goal), and that would be conditioned on the potential future updates to those standards, which treat the Birnbaum doctrine as a practical guideline rather than a necessary constraint. But, given versioning, and with a clear statement on the conformance target, I think it should be acceptable.

I'd work on that in the LingSIG fork, so that the work space would be open both to ISO experts and to practitioners from individual corpus projects, and to the Council, and to SIG members.

Would that be an acceptable plan?

I'm seeking a "stamp of approval" by the Council, that may be expressed by an assignment of an issue. So that I could cite that as justification for the time spent on this, and as basis for encouraging colleagues from DIN/ISO to potentially contribute (so that they would know that they are not asked to contribute to a pet project).

sydb commented 2 years ago

I don’t understand the conditionality — when those standards get updated, how do we know, and what do we do with the exemplar? — But otherwise it seems like an excellent idea to me.

bansp commented 2 years ago

I may have compressed too much into the message. Perhaps there is no conditionality to speak of, but rather something obvious -- the exemplar would be naturally tied to concrete versions of the relevant standards, and as long as I were in a position to, I would make sure to keep them in sync, else I would hope that someone representing TC37 SC4 WG6 would, simply because such an exemplar would be a handy reference implementation for some of the work done by the committee/WG. If a need arose, the previous versions could surely be stored somewhere accessible, e.g. in the LingSIG github, etc. And I'm happy that you like the idea, Syd! :-)