TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
269 stars 88 forks source link

obsolete text in chapter 15 Language Corpora #2445

Closed Conal-Tuohy closed 5 months ago

Conal-Tuohy commented 1 year ago

The text of the chapter appears to assume that the teiCorpus element cannot contain teiCorpus elements, e.g.

In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. The teiCorpus element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged TEI.

If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the group element described below and in section 4.3.1 Grouped Texts.

https://github.com/TEIC/TEI/blob/dev/P5/Source/Guidelines/en/CC-LanguageCorpora.xml#L178C1-L189C63

sydb commented 1 year ago

Good catch. In fact, the prose in the rest of this section (up to “Contextual Information”) needs work, too.

bansp commented 9 months ago

One could mention in this section the reverse approach that several corpora have used, whereby each corpus document includes each header, from each level (main corpus, subcorpus, maybe even sub-subcorpus), and becomes a well-described free-standing object. An example of that can be seen at, e.g., http://nlp.ipipan.waw.pl/TEI4NKJP/example_all_levels_1M/text.xml

<teiCorpus xmlns:xi="http://www.w3.org/2001/XInclude" xmlns="http://www.tei-c.org/ns/1.0">
<xi:include href="NKJP_1M_header.xml"/>
<TEI>
<xi:include href="header.xml"/>
<text xml:id="txt_text" xml:lang="pl">
<body xml:id="txt_body">

This way, there is no fear that a tool that attempts to read the root corpus document (with XInclusions) chokes on gigabytes of text pulled in for the individual subcopora and documents.

sydb commented 7 months ago

@Conal-Tuohy — Created PR, but do not seem to be able to add you as reviewer. Would you mind taking a look?

Conal-Tuohy commented 7 months ago

@Conal-Tuohy — Created PR, but do not seem to be able to add you as reviewer. Would you mind taking a look?

It looks good to me, @sydb !