Corpus Exemplar should more explicitly state what it's for

martindholmes commented 5 years ago

Council looking at another ticket realized that if in Oxygen you do File / New and choose Corpus from the TEI P5 options, you get something which doesn't support the full content model of teiCorpus. This is "correct" because that Exemplar is intended only for linguistic corpora; however, that's not clear from the generic "Corpus" label. That label should be changed to "Linguistic Corpus".

raducoravu commented 5 years ago

This would mean changing the actual names of the template files on disk:

tei\templates\TEI P5\Corpus.xml tei\templates\TEI P5\Corpus.properties

lb42 commented 4 years ago

What is meant by "the full content model" here? the content model you would get if you used TEI all instead? But in that case, what's the point of defining this ODD at all? Obviously, renaming it is not a bad idea, though if you actually read the ODD text it's pretty explicit!

sydb commented 4 years ago

Better yet, it would mean getting rid of such a useless “customization” altogether. (And to be clear, I don’t really mean getting rid of the Corpus template that @raducoravu mentions above, but rather the ridiculous customization upon which it is based.) I call these customizations that TEI-C makes publicly available, and IMHO should mostly not, “template” customizations. See this article paragraphs 26, 27, & 28 for a list of them. That is to say, in response to @lb42’s question which he posted while I was writing this:

what's the point of defining this ODD [tei_corpus.odd] at all?

There is none, it should not be defined. At least not in the P5/Exemplars/ directory.

martindholmes commented 4 years ago

I think there's value in providing a decent starting-point for people working on specific types of text. There are two advantages: first, those who cannot and will never customize anything are at least not consigned to working with tei_all; and second, those who can and will customize have a decent example/starting point that they can work from.

lb42 commented 4 years ago

Those who "never will customize" may also perhaps learn why this is a suboptimal strategy by being forced to confront the limitations of the existing exemplars!

bansp commented 4 years ago

It would be interesting to know how many linguistic corpora have been based on this customization. I've always started with TEI Bare myself, but I fully agree with Martin that 'thematic' customizations are useful, and further, they can indeed constitute selling points for some communities.

In the general spirit of how/where this thread is moving, I would say that, especially given the recent and ongoing developments in ISO-TEI standards, it would be good to offer a (nearly) out-of-the-box modern corpus-linguistic customization as a starting point for linguists.

Would it be acceptable to (a) label this customization as "under reconstruction", (b) not allow it to pop up as the first choice the way it recently did for Marjorie and (c) initiate work on its thorough update/upgrade? I'm in a pretty good position to to suggest insights from the point of view of ISO, in a process that includes the CLARIN community (so that it's not one man's proposal but rather a community baby). And I believe that I should be able to allocate time for that in my schedule (I'd need to verify that, of course, but chances are good).

martindholmes commented 4 years ago

I don't think we should label anything that's part of the TEI release or the Oxygen plugin as "under construction". The first thing to do is I think to re-label this as Linguistic Corpus, and then anyone qualified can work on making it better. The second question for me is whether any kind of parallel "Non-linguistic corpus" schema should be available, or whether we just assume people will then choose tei_all.

sydb commented 4 years ago

@martindholmes:

I think there's value in providing a decent starting-point for people working on specific types of text.

Agreed. But this (P5/Exemplars/tei_corpus.odd and oxygen-tei/frameworks/tei/templates/TEI P5/Corpus.*) is not even remotely close to a decent starting point. It is better characterized as indecent.

@lb42: A good idea, in theory, but what in fact happens is people use the template ODDs in Exemplars/ as end-point schemas. Which is worse, I think.

@bansp:

it would be good to offer a (nearly) out-of-the-box modern corpus-linguistic customization as a starting point for linguists.

Agreed. If you can find the time, I’d probably prefer almost anything you come up with. (But @martindholmes is right, it’s not OK to label an Exemplar/ as under construction, especially one that isn’t. I think we should just get rid of the current tei_corpus.odd for now, and do without until you or someone else comes up with something better. But I’ve thought we should get rid of these “template” exemplars for years, and it hasn’t happened yet. So I’m not going to hold my breath.)

bansp commented 4 years ago

Good point about "under construction" in a production deliverable.

sydb commented 4 years ago

I expect this table to be edited in-place, here in this comment. (This probably is not the best place for this information, but I’ve already done it, so …)

name	category	comments	`@n`	title	author	# nodes	# elements (TEI/all)
isofs.odd	demo	just starting point of `<fs>`, `<fLib>`, or `<fvLib>`	isofs	ISO Feature Structures freestanding schema	TEI-C	76	27 / 28
tei_all.odd	necessary	has useful prose description of its limitations		TEI with maximal setup	SR	112	42 / 42
tei_allPlus.odd	demo	all + SVG + MathML; refers to defunct declarefs module		TEI with maximal setup, plus external additions	SR	150	51 / 55
tei_bare.odd	sample demo	exemplifies extreme thinning and use of `<specGrp>`s		TEI Absolutely Bare	LR	388	145 / 145
tei_corpus.odd	template			TEI for Linguistic Corpora	SR	71	25 / 25
tei_customization.odd	sample generated			TEI ODD Customization for writing TEI ODD Customizations	SB	20840	7369 / 7583
tei_dictionaries.odd	template			TEI with minimal setup for dictionaries		85	30 / 30
tei_docs.odd	template			TEI for documentation	LB	74	25 / 25
tei_drama.odd	template	mildly useful: demonstrates attribute deletion (poorly)		TEI with Drama	SB SR	146	52 / 52
tei_enrich.odd	sample		enrich	TEI P5 schema for ENRICH	[SR,LB,JC]	6053	1362 / 2108
tei_its.odd	demo		testminimal	TEI with ITS setup	SR	124	41 / 45
tei_jtei.odd	sample	important		jTEI input customization	RVB MH	5597	1322 / 1975
tei_lite_fr.odd	sample			Encoder pour échanger : une introduction à la TEI	LB MSM	5531	1226 / 2044
tei_lite.odd	sample			Encoding for Interchange: an introduction to the TEI		6044	1283 / 2061
tei_math.odd	template	required for tei_allPlus		TEI with MathML	SR	113	41 / 41
tei_minimal.odd	template	nothing but “the ten required elements”; no attribute deletion		TEI Minimal	JC	149	52 / 52
tei_ms.odd	template		testms	TEI for Manuscript Description	SR	99	35 / 35
tei_odds.odd	template			TEI for authoring ODD	SR	99	37 / 37
tei_simplePrint.odd	sample			An Introduction to TEI simplePrint		10138	2554 / 3592
tei_speech.odd	template			TEI for Speech Representation	LR	110	38 / 38
tei_svg.odd	template	required for tei_allPlus	tei_svg	TEI with SVG	SR	120	37 / 39
tei_tite.odd	sample			TEI TiteA recommendation for off-site text encoding	PT	2530	810 / 854
tei_xinclude.odd	demo		tei_xinclude	TEI with XInclude (experimental)	SR	208	75 / 75

ebeshero commented 3 years ago

Council VF2F: We should proceed by:

Moving the problematic ODDs to another directory: ODD-Stubs
Finding out whether oXygen is pulling directly from Exemplars/ or how they are pulling their TEI sample starter files

bansp commented 2 years ago

I would like to work on an exemplar that would build a text corpus template from scratch, based strictly on the ISO TC37 SC4 WG6 standards family (MAF, SynAF, LAF; possibly Speech but not as a direct goal), and that would be conditioned on the potential future updates to those standards, which treat the Birnbaum doctrine as a practical guideline rather than a necessary constraint. But, given versioning, and with a clear statement on the conformance target, I think it should be acceptable.

I'd work on that in the LingSIG fork, so that the work space would be open both to ISO experts and to practitioners from individual corpus projects, and to the Council, and to SIG members.

Would that be an acceptable plan?

I'm seeking a "stamp of approval" by the Council, that may be expressed by an assignment of an issue. So that I could cite that as justification for the time spent on this, and as basis for encouraging colleagues from DIN/ISO to potentially contribute (so that they would know that they are not asked to contribute to a pet project).

sydb commented 2 years ago

I don’t understand the conditionality — when those standards get updated, how do we know, and what do we do with the exemplar? — But otherwise it seems like an excellent idea to me.

bansp commented 2 years ago

I may have compressed too much into the message. Perhaps there is no conditionality to speak of, but rather something obvious -- the exemplar would be naturally tied to concrete versions of the relevant standards, and as long as I were in a position to, I would make sure to keep them in sync, else I would hope that someone representing TC37 SC4 WG6 would, simply because such an exemplar would be a handy reference implementation for some of the work done by the committee/WG. If a need arose, the previous versions could surely be stored somewhere accessible, e.g. in the LingSIG github, etc. And I'm happy that you like the idea, Syd! :-)

TEIC / TEI

Corpus Exemplar should more explicitly state what it's for #1916