att.linguistic for <w> and <pc>

bansp commented 7 years ago

Quick links:

diff of the pull request (will be kept synced against TEIC/TEI/dev)
suggested text of the relevant chapter (minimal changes, pending acceptance)
suggested documentation of att.linguistic

Introduction and summary

We propose an attribute class that gathers token-level attributes facilitating simple linguistic annotation. We propose that <w> and <pc> should be its members by default. At the same time, grouping these attributes in a separate class (rather than defining them inside an element) makes it possible for other token-level elements to be added to that class, in TEI customizations. Our proposal addresses both the future uses (extending the TEI offer towards corpus linguists) as well as the existing resources (enriching the existing, for example literary or historical corpora, with basic linguistic annotation).

The proposed class, att.linguistic, contains first of all two linguistic attributes originally defined inside <w>, namely @lemma and @lemmaRef. In addition to these, we suggest to add the following:

@pos (for part-of-speech)
@msd (for morphosyntactic description)
@reg (for the regularized form in e.g. historical corpora)
@join to signal the absence of whitespace next to the element -- this is meant for all non-mixed-content contexts where the whitespace information is not conveyed by the actual presence of whitespace characters

We stress that the goal of this feature request is not to facilitate in-depth linguistic annotation, but rather to equip "off-the-shelf" TEI in the very basic tools that linguists can use, and that non-linguists can safely add to their existing resources in order to enhance them. This is why we address the TEI namespace directly. Below, we first present the motivation for this feature request, offer a detailed description of the suggested attributes, and then address some potential counterproposals and provide further examples.

This is a lengthy ticket, but it concerns complex matter, and we want to minimize the danger of multi-thread discussion developing in the comments section, concerning points that we can address together and in a structured manner. The ticket is accompanied by a pull request (see the link above) which among others suggests documentation for the new class. This is in order to be maximally clear about what we suggest and also in order to minimize the workload expected of the Technical Council.

Context and motivation

The topic of the perceived inadequacy of the vanilla TEI recommendations (including the "Corpus" extension available from Roma) for building technologically effective, rather than merely theoretically "nice" linguistic resources has stirred some of the TEI community for years, essentially ever since the massively successful fork of the early TEI called CES (Corpus Encoding Standard). (X)CES and its variants have dominated the scene of light-weight language resources for years, gathering a large group of followers and producing an immense (for that time) amount of linguistic encoding.

Today, the TEI has a chance to claim a large part of the language resource market and to present an attractive offer not only to linguists as such, but also to philologists and text technologists who appreciate the added value that comes from enriching the existing structural markup with linguistic information. Especially in resources that already use the <w> element, enrichment with part-of-speech and morphosyntactic information in addition to lemma identification drastically expands search options, options for sorting the results, or for investigating author-specific traits. Being able to provide local information on regularized forms affects search and display, and makes the system far more user-friendly and useful.

Efforts targeting the enhancement of TEI markup for lexical resources as well as for constructing structurally robust multi-layer linguistic annotation are the topics of other projects that the LingSIG is involved in. In this very ticket, we concentrate on minimal extensions to the attribute repertoire of <w>, aiming at a compromise between the needs of modern-day language technology and the tendency to keep extensions to the mainstream TEI in check.

Description of the proposed additions

`@pos` and `@msd`

"POS" stands for "part of speech" and the attribute contains a symbol that classifies the content of the given <w> in opposition to other elements in the POS space, according to predefined criteria (typically a mixture of lexical, syntactic, and semantic ones). Typical, textbook-level parts of speech include the familiar "noun" and "verb", but for linguistic-technological purposes the divisions are more fine-grained. The number of symbols in a tagset depends on what criteria are taken into account and on how much morphosyntactic information is mixed into the POS tagset (in practice, mixing "all" morphosyntactic information into the POS tagset is only possible in languages with relatively simple morphosyntax). A very simple repertoire of POS values can be found in the so-called Universal POS Tagset, where no morphosyntactic information is present. On the other end of the scale, one could place e.g. the CLAWS-8 tagset, which mixes parts of speech with morphosyntax, and thus distinguishes, e.g., between "base form of a lexical verb" and "past tense of a lexical verb" and "past participle of lexical verb" (all three of which would be just a "verb" in the 'Universal Tagset'). We mention this information here in order to explain why, for certain languages, for certain uses, and for certain tools, POS information and morphosyntactic information can be merged.

Most parts of speech have sets of morphological or morphosyntactic categories associated with them (prepositions in English do not, verbs do). Information on the particular values of those categories constitutes morphosyntactic description and @msd is the place for it. Recall that some extended POS tagsets incorporate morphosyntactic description directly, but that is not an option for all languages. Tens of corpora (e.g. those in MULTEXT and MULTEXT-East projects) have followed the CES guidelines which distinguish between POS and MSD. See the original CES documentation (only mildly helpful) and a fragment of T. Erjavec's tutorial on morphosyntactic tagging for the Multext-East project (using CES at that time, note that 'ctag' was CES's name for 'pos'):

  <tok type=WORD>
      <orth>glass</orth>
      <disamb><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></disamb>
      <lex><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></lex>
      <lex><base>glass</base><msd>Ncns</msd><ctag>NN</ctag></lex>
     </tok>

Above, <lex> lists output of the morphological analyser, whereas <disamb> encloses a disambiguated sequence. This is still mild, now compare a case from the National Corpus of Polish (file ann_morphosyntax.xml; warning: the file is HUGE), a set of potential interpretations of the adjectival form 'kategoryczne' ("base" is the lemma):

       <f name="interps">
          <fs type="lex" xml:id="morph_p-7.208-seg_0-lex">
           <f name="base">
            <string>kategoryczny</string>
           </f>
           <f name="ctag">
            <symbol value="adj"/>
           </f>
           <f name="msd">
            <vAlt>
             <symbol value="pl:nom:m2:pos"/>
             <symbol value="pl:nom:m3:pos"/>
             <symbol value="pl:nom:f:pos"/>
             <symbol value="sg:nom:n:pos"/>
             <symbol value="pl:nom:n:pos"/>
             <symbol value="pl:acc:m2:pos"/>
             <symbol value="pl:acc:m3:pos"/>
             <symbol value="pl:acc:f:pos"/>
             <symbol value="sg:acc:n:pos"/>
             <symbol value="pl:acc:n:pos"/>
             <symbol value="pl:voc:m2:pos"/>
             <symbol value="pl:voc:m3:pos"/>
             <symbol value="pl:voc:f:pos"/>
             <symbol value="sg:voc:n:pos"/>
             <symbol value="pl:voc:n:pos"/>
            </vAlt>
           </f>
          </fs>
        </f>

The first option denotes a bundle of features "plural+nominative+animate masculine+positive" (see the tagset documentation for other categories and values).

The 'universal' (in a pragmatic sense) repertoire of morphosyntactic features can be seen in the guidelines offered by the Universal Dependencies project.

`@reg`

In historical corpora, regularization is a very frequent matter which may be applied not only to words but also to punctuation characters.

Our argument for introducing @reg as an attribute of <w> rests on the following assumptions and observations:

@reg greatly simplifies the encoding of some historical corpora -- it is not a theoretical wonder and cure-all, it is just an attested convenience
it is attested in large datasets (e.g., EEBO/TCP, DTA -- see examples below), and we believe that it would constitute (a) a "political" advantage for the TEI to codify the usage of this attribute (in the sense of embracing a greater number of resources out-of-the-box), and that (b) codifying its usage would be of theoretical benefit for encoders, because of the several non-obvious decisions that must be made when deciding on using regularization and because of some fringe phenomena that should receive a clear interpretation -- these issues should ideally be highlighted and documented in the Guidelines, rather than subject to ad-hoc decisions.

The @reg attribute can be used to mark up regularization in corpora with non-standardized spelling (historical corpora, literary corpora of historical texts, etc.). The following examples illustrate this:

<w reg="freiwillig">freywillig</w>
<pc reg="," join="left">/</pc>
<w reg="unbedrängt">vnbedraͤngt</w>
<w reg="und">vnd</w>
<w reg="unverhindert">vnuerhindert</w>

Source: Aviso. Relation oder Zeitung. Wolfenbüttel, 1609. In: Deutsches Textarchiv.

<w reg="unvermutete">vnuermuthete</w>
<w reg="Freundschaft">Freundſchafft</w>
<w reg="angeboten">angebotten</w>

Source: Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631. In: Deutsches Textarchiv.

The TEI has at its disposal the powerful mechanism of //choice/reg|orig for such cases. However, with the usage of //choice/reg|orig for the annotation of regularized word forms, text annotation in specialized corpora may need to be extended significantly since there may already exist further (elaborate) annotation on or below the token level, which gets automatically multiplied with the use of <choice>. If the <choice> machinery is applied, these existing annotations will end up being represented twice: once in <reg> and once in <orig>. Thus texts might be significantly extended due to such unnecessary repetition, which may also negatively affect the sustainability of the resource. The following examples illustrate this problem:

Example: Tagging of highlighting within a word (here: typeface switch from Fraktur to Antiqua):

<!-- tagging with //choice -->
<choice>
  <orig>
    <w lemma="wohlstilisierend" pos="ADJA">Wohl-<hi rendition="#aq">ſtyliſi</hi>rende</w>
  </orig>
  <reg>
    <w lemma="wohlstilisierend" pos="ADJA">Wohl<hi rendition="#aq">stilisie</hi>rende</w>
  </reg>
</choice>

<!-- tagging with @reg -->
<w lemma="wohlstilisierend" pos="ADJA" reg="Wohlstilisierende">
  Wohl-<hi rendition="#aq">ſtyliſi</hi>rende
</w>

Source: Marperger, Der allzeit-fertige Handels-Correspondent. Hamburg, 1717, facs. 655. In: Deutsches Textarchiv.

Example: Word syllabification and hyphenation at the end of a page

In cases of word syllabification and hyphenation at the end of a page, comparably large amounts of text material have to be included in the <w> element (like running titles, catchwords, pagebreaks). The use of the <choice> mechanism would mean extending this by repetition. The alternative is a single attribute doing the necessary work -- compare what would have to become of the example below if the //choice/reg|orig machinery were to be employed, against using just the following substitution in the second line: <w reg="desselbigen">deſ-<lb/>.

<w>Flecken</w> <w>oder</w> <w>Dorff</w> 
<w>deſ-<lb/>
  <fw place="bottom" type="catch">ſelbi-</fw>
  <pb facs="#f0672" n="656"/>
  <fw place="top" type="header">
    <hi rendition="#b">Von der</hi>
    <hi rendition="#aq #i">Præſtan</hi><hi rendition="#b">tz</hi>
    <hi rendition="#b">und Vortreflichkeit</hi>
  </fw><lb/>
ſelbigen</w> 
<w>Landes</w>

Source: Marperger, Der allzeit-fertige Handels-Correspondent. Hamburg, 1717. In: Deutsches Textarchiv. part 1: facs. 671, part 2: facs. 672.

Example: Word syllabification and hyphenation combined with marginal notes

Furthermore, consider cases of word syllabification and hyphenation at the end of a line which is followed by a marginal note. Here, the lexical material of the marginal note would be subject to linguistic analysis as well, thus resulting in several nested //choice/reg|orig elements, increasing the redundancy and the curation effort.

<w>eigen-
  <note place="right">
    <w>Drey</w>
    <w>Ei-<lb/>genſchaff-<lb/>ten</w>
    <w>der</w><lb/>
    <w>Seelen</w><lb/>
    <w>einge-<lb/>pflantzet.</w>
  </note><lb/>
ſchafften</w>

Source: Arndt: Von wahrem Christenthumb. Vol. 1. Magdeburg, 1610. In: Deutsches Textarchiv

When using <choice>, one would always have to include the additional text parts twice in the transcription (once in <orig> and once in <reg>) or to shift the complexity elsewhere (without diminishing it -- escaping complexity in this case may mean standoff markup or fragmented markup, neither option being particularly attractive). We argue that, while not as powerful as <choice>, in the cases presented here and similar, @reg constitutes a very reasonable encoding compromise that will allow numerous encoding projects to embrace TEI nearly out-of-the-box.

Several issues need to be made clear when applying regularization, and below, we mention the most important of them. These issues should be highlighted and exemplified in the documentation, and we will be happy to submit the relevant extensions to the Guidelines and/or specs if the present request is accepted.

(attribute issue) @reg may need to store multi-word sequences (example: German historical form ſovieler regularized into "so vieler"); in this, @reg is fully analogous to the existing attribute @lemma, using teidata.text as its datatype; the analogy with @lemma extends to most other issues concerning @reg as an attribute.
(attribute issue) many-to-one mismatches: two or more consecutive words can be regularized/modernized into a single word; in such cases, the question is whether the given corpus can 'reserve' the @prev and @next attributes to perform the linking role (together with a convention saying that, e.g., the first of such <w>s is going to carry the @reg attribute. If this cannot be guaranteed, it means that the <choice> system needs to be used instead of @reg.
(general issue) a decision which data stream, original or regularized, is subject to grammatical (POS, MSD) annotation -- this should be stated in any case, probably in projectDesc (unless the Council suggests a different placement).
- some combinations of the options (original/regularized, <choice>/@reg) may be difficult or impossible to realise -- when, for example, a single-token original is regularized to more than one token inside the @reg attribute, and it's the regularized data stream that should be annotated for POS. We do not treat such potential cases as invalidating the entire approach; they merely show that the convenience @reg attribute is not usable in just such cases, and that the encoder has to resort to the <choice> machinery and possibly deal with the overwhelming complexity of a construction rooted in <choice> (as illustrated in the examples in this section).
(general issue): non-consecutive words are regularized into a single word;
- many sub-variants of the case are imaginable, and -- paradoxically perhaps -- in the absence of a standoff analysis, it is the @reg approach together with fragmentation handling (@prev and @next) that may be a better tool to handle some of such cases, because using <choice> would involve not only additional lexical information but would also require a potentially arbitrary word-order decision in the <reg> branch.

`@join`

Consider a version of an example used in Chapter 17: "What did you make up?" (unlike the Guidelines, we retain the punctuation).

<s><w>What</w> <w>did</w> <w>you</w> <w xml:id="mk01">make</w> <w xml:id="up01">up</w><pc>?</pc></s>

Now contrast this with the version actually provided there (we add the <pc> for completeness, but the argument would hold for <w> as well, in compounds written together, but commonly analysed as composed of separate tokens).

<s>
 <w>What</w>
 <w>did</w>
 <w>you</w>
 <w>make</w>
 <w>up</w>
 <pc>?</pc>
</s>

The first, inline example provides more information than the second -- it uses additional typographical markup, namely whitespace. The second, sequential example only lists the tokens in the order found in the sentence, but loses the information on the lack (or the presence) of whitespace. In order to preserve this kind of information, we adopt the @join attribute from ISO MAF (Morpho-syntactic Annotation Framework, ISO 24611:2012), with the values: 'left', 'right', and 'both'. With this attribute, our sequential example would be rendered as follows:

<s>
 <w>What</w>
 <w>did</w>
 <w>you</w>
 <w>make</w>
 <w join="right">up</w>
 <pc join="left">?</pc>
</s>

An issue may arise concerning the redundancy of marking the absence of whitespace on two elements. From the top-down, global perspective, it is indeed redundant. From the bottom-up, "streamable" perspective, it is not redundant, and it is the latter perspective that we assume as fundamental: we want to make the TEI more attractive for the linguistic enrichment of massive amounts of text that are submitted to rapid processing in a streaming fashion. Nothing precludes project-specific approaches to this issue. For example, the National Corpus of Polish only marked the absence of the preceding whitespace, on all segments. We would like to provide support for the general, redundant, "streamable" case, of which project-specific decisions can be proper subsets.

Existing alternatives to the proposed attributes

It might be suggested that the existing attributes @ana, @corresp and @type can be used for the purpose of providing lightweight linguistic markup. This section argues against that suggestion on several grounds.

A fixed, homogeneous set of linguistic features

Especially in the existing literary resources, which may become subject to enrichment with linguistic markup, the three attributes above, in any combination, may have already been put to work for the purposes of literary analysis, prosopography, etc. In such cases, the devil-advocate suggestion would be "use those of them that happen to be available and squeeze linguistic analysis into them". Such an approach, however, would immediately lead to heterogeneity exactly where we need order and homogeneity: linguistic annotation would generally start with information on lemma, POS, often also morphosyntactic description and (especially for texts with non-standardized orthography) the regularized form. The set of terms used to name these features is quite well established among linguists. Giving up this homogeneity already at the start in favour of ad hoc arrangements would have severe effects on the interoperability of the resulting corpora.

Our goal here is to provide a structured and well-documented proposal for information containers which represent basic and frequent features of linguistic analysis, and in doing so, to increase the attractiveness of the TEI for those interested in quickly and efficiently creating resources with lightweight linguistic markup, rather than enhancing the feeling that "anything goes".

Reserving some of the generic attributes for linguistic purposes would mean (a) removing them from the general pool of attributes available for non-linguistic uses and (b) cutting off some of the legacy literary resources that already use those attributes.

Avoiding pointers

@ana and @corresp are pointer-based, and a large part of our motivation is to create a processing-friendly and self-contained system, with no need of pulling information from any outside sources.

Consider an example from section 17.4 (v. 3.1.0, generated on 2016-12-15):

<s>
 <w ana="#AT0">The </w>
 <w ana="#NN1">victim</w>
 <w ana="#POS">'s</w>
 <w ana="#NN2">friends </w>
 <w ana="#VVD">told </w>
 <w ana="#NN2">police </w>
 <w ana="#CJT">that </w>
 <w ana="#NP0">Kruger </w>
 <w ana="#VVD">drove </w>
 <w ana="#PRP">into </w>
 <w ana="#AT0">the </w>
 <w ana="#NN1">quarry </w>
 <w ana="#CJC">and </w>
 <w ana="#AV0">never </w>
 <w ana="#VVD">surfaced</w>
</s>

We set aside the absolutely non-linguistic practice of random inclusion of the following whitespace inside <w>, which betrays the made-up status of this example, and focus on the use of @ana here. The pointers in the examples are resolved in an <interpGrp> placed most probably within the TEI Header. For POS tagging this means, on average, minimally about 50 URIs to resolve (50 for the German basic STTS, over 160 for the English CLAWS-8). In practice, however, for efficient further processing, analysis tools would probably just have to be adjusted to not resolve the links but rather pre-parse them by cutting off the leading '#' and treat the rest as atomic values (comprising all the information needed for most scenarios). Thus, the original TEI design would be circumvented for pragmatic reasons.

Next, the question would be where to place the results of further analysis, such as morphosyntactic information. POS and MSD could be packed into a single @ana attribute, probably as a sequence of two URIs (and we put aside the fact that attaching semantics to a position in a chain of URIs inside a single attribute value smells of malpractice): the Polish example of multiple values of @msd is by no means the most extreme one -- the number of combinations is literally in the order of hundreds. Do we expect that all these URIs are resolvable somewhere in the corpus header? And if we do not assume that, but rather treat the leading '#' as a necessity imposed by the datatype of the attribute, it means that in practice we're saying, "well, if you want to encode your corpus in the TEI, we have this kludge here prepared for you, whereby you just need to pretend that these URIs aren't URIs...". Note also that we can only offer this kludge to freshly built corpora that haven't used @ana for any other purpose -- but why build a fresh corpus upon kludges?

Summing up, aside from resigning from the proposed homogeneity of feature names and feature values that we focused on in the previous section, @ana could be made to work for the indicated purpose in a project that would (i) clearly narrow its semantics (as we can see in ch. 17, @ana is used for many purposes, including various kinds of literary analysis) and that would (ii) adjust its analysis tools to not resolve the links but rather to pre-parse them by cutting off the leading '#' and treat the rest as atomic values. But this shows exactly how and where our suggestion is superior: first of all, the above scenario works only for corpora that have been designed for (sparse) linguistic markup from the beginning: the attribute @ana is not used for literary analysis there, and processing/analysis tools must be sensitive to the internal make-up of its value. This alone excludes legacy literary corpora and off-the-shelf linguistic tools.

Suggested deployment strategy

Minimally, we expect that the only impact visible in the schema will be four new attributes usable for the <w> and <pc> elements. However, because these attributes are now (together with @lemma and @lemmaRef) gathered into a separate class, they will be available to ODD designers for inclusion into other elements. So, for example, people who plan to base their tokenization on the <seg> element, will be able to add it to att.linguistic. We suggest to add <seg> to this class in the official customization for "Linguistic Corpora" (available through Roma).

Adoption of this ticket, in whole or in part, should be reflected in the text of the Guidelines, and the present pull request suggests minimal changes to this effect. If the ticket is accepted, we will be happy to supply a proposal for thorough modifications in the "Simple Analytic Mechanisms", preferably in a separate pull request, in case these changes become a subject of extended debate.

The ISO MAF specification envisions two more values for @join, namely "no" and "overlap". In order to preserve full compatibility with ISO MAF, we suggest to keep all the values and that is how the attribute is defined the spec.

Illustration

The fragment below comes from "His Majesties declaration: to all his loving subjects, of the causes which moved him to dissolve the last Parliament. Published by His Majesties speciall command", part of TCP Phase I, 1640-1660 texts, published on GitHub by Martin Mueller in the English-Civil-War repository, with some enhancement in the form of the @join attribute:

<p xml:id="A22757e-50">
  <w lemma="publish" pos="j_vn" reg="Published" xml:id="A22757-001-b-0230">Published</w>
  <w lemma="by" pos="acp-p" reg="by" xml:id="A22757-001-b-0240">by</w>
  <w lemma="his" pos="po" reg="his" xml:id="A22757-001-b-0250">His</w>
  <w lemma="majesty" pos="n1g" reg="majesty's" xml:id="A22757-001-b-0260">Majesties</w>
  <w lemma="special" pos="j" reg="special" xml:id="A22757-001-b-0270">speciall</w>
  <w lemma="command" pos="n1" reg="command" xml:id="A22757-001-b-0280" join="right">command</w>
  <pc unit="sentence" xml:id="A22757-001-b-0290">.</pc>
</p>

The next fragment comes from "A most notable and worthy example of an vngratious sonne, who in the pride of his hart denied his owne father and how God for his offence turned his meate into loathsome toades. To the tune of Lord Darley.", part of the EarlyPrints repository.

<l xml:id="mabd-e110">
  <w lemma="to" pos="acp-cs" reg="to" xml:id="mabd-001-a-0880">to</w>
  <w lemma="weep" pos="vvi" reg="weep" xml:id="mabd-001-a-0890">weepe</w>
  <w lemma="and" pos="cc" reg="and" xml:id="mabd-001-a-0900">and</w>
  <w lemma="wring" pos="vvi" reg="wring" xml:id="mabd-001-a-0910">wring</w>
  <w lemma="their" pos="po" reg="their" xml:id="mabd-001-a-0920">their</w>
  <w lemma="hand" pos="n2" reg="hands" join="right" xml:id="mabd-001-a-0930">handes</w>
  <pc unit="sentence" xml:id="mabd-001-a-0940">.</pc>
</l>

The following example comes from 18th century German prose, taken from Blumenbach, Johann Friedrich: Zwo Abhandlungen über die Nutritionskraft. St. Petersburg, 1789; see also the respective page in: Deutsches Textarchiv. (Note that the @join attribute was added to the original data here.)

<p>
  <w lemma="d" pos="ART" reg="Die">Die</w>
  <w lemma="Kürze" pos="NN" reg="Kürze">Kürze</w>
  <w lemma="diese" pos="PDAT" reg="dieser">dieser</w>
  <w lemma="Blatt" pos="NN" reg="Blätter">Blätter</w>
  <w lemma="sein" pos="VAFIN" reg="ist">ist</w>
  <w lemma="wohl" pos="ADV" reg="wohl">wohl</w>
  <w lemma="d" pos="ART" reg="das">das</w>
  <w lemma="gering" pos="ADJA" reg="geringste">geringste</w>
  <pc pos="$," join="left">,</pc>
  <w lemma="was" pos="PRELS" reg="was">was</w>
  <w lemma="ich" pos="PPER" reg="ich">ich</w>
  <w lemma="dabei" pos="PAV" reg="dabei">dabey</w>
  <w lemma="zu" pos="PTKZU" reg="zu">zu</w>
  <lb/>
  <w lemma="entschuldigen" pos="VVINF" reg="entschuldigen">entschuldigen</w>
  <w lemma="haben" pos="VAFIN" reg="habe">habe</w>
  <pc unit="sentence" pos="$." join="left">.</pc>
</p>

The following is an example from 17th century German prose, taken from Friderici, Daniel: Musica Figuralis, Oder Newe Klärliche Richtige/ vnd vorstentliche vnterweisung/ Der SingeKunst. Rostock, 1619; see also the corresponding page in Deutsches Textarchiv.

<p>
  <hi rendition="#b">
    <w lemma="was" pos="PWS" reg="Was">Was</w>
    <w lemma="sein" pos="VAFIN" reg="ist">iſt</w>
    <w lemma="d" pos="ART" reg="die">die</w>
    <w lemma="Musik" pos="NN" reg="Musik">Muſica</w>
    <pc unit="sentence" pos="$." join="left">?</pc>
  </hi>
</p>

eduarddrenth commented 7 years ago

Thanks for this request, which improves linguistic support in TEI. Some challenges remain, especially when querying, presenting and researching TEI material in diffently annotated corpora.

1) @pos can have any content (which of course supports all tagsets) 2) @msd groups together several features (person, number, ...) and as well does not restrict content

Another remark is that in the documentation text I see no examples on @pos or @msd. Perhaps it is good to include a rationale behind the generalistic approach and to show in examples how to deal with the consequences for example when developing tools.

susannehaaf commented 7 years ago

Hi Eduard, thanks for your remarks. One quick note: There are examples on @pos and @msd in the proposed specification of att.linguistic which is also referenced in the "Quick links" section of the current issue.

eduarddrenth commented 7 years ago

Thank you, I was aware of these, I was thinking of more elaborating examples and text in 17.4. But perhaps it is too early for that because this should be based on experience in practice.

peterstadler commented 7 years ago

Just a quick note that I will prepare this issue (along with the other LingSIG tickets) for the upcoming Council face2face meeting in November.

bansp commented 7 years ago

Thank you, Peter. For what it's worth, here's a quote from Martin Mueller's e-mail sent today to TEI-L (above, we have only referred to a github repository of EarlyPrints, not it has its own site):

Dear Colleague, I’d like to tell you about the first release of the EarlyPrint project at https://drama.earlyprint.org. From a TEI perspective this is a project that adds linguistic annotation to P5 versions of TCP texts and puts them in an application that supports the collaborative curation of the most common textual defects in those texts. The application is a version of TEI Publisher with the addition of an Annotation Module that was built by the eXist team and funded by the Mellon foundation. The linguistic annotation is provided by Phil Burns’ MorphAdorner. We have used the attribute set for lightweight linguistic annotation that Piotr Banski, Susanne Haaf, and I have proposed to the TEI Council. I am an interested party in this but will say anyhow and with a lot of conviction that the combination of @pos, @lemma, @reg, and @join attributes significantly simplifies many processing tasks. This is an early release, and lots of work remains to be done. We’ll be grateful for advice and criticism. Martin Mueller

bansp commented 6 years ago

I have changed the static links above into dynamic links to our continuous integration space, so that it can be seen what effects our proposed changes would have on the current Guidelines (it is still dependent on me keeping the LingSIG repositories up-to-date wrt TEIC/dev, which I try to do as often as I can).

Let me also add that a paper co-authored by myself, Susanne Haaf and Martin Mueller, containing much of the original ticket has been accepted for an oral presentation at the main session of the upcoming LREC conference, approved by three peer reviewers. At the moment when we decided to submit the paper (the original deadline was Sep 25th; our decision to submit was made in late July), we wouldn't even dream of the ticket not getting touched by the Council by mid-September, so our only variable back then was whether we would present a fact under implementation (with the Council's possible suggestions), as TEI advocacy to the language resource community, or whether we would still describe an idea for improvement, with clear motivation. Given all the past months of silence, I dread to think at what stage this ticket may be in May, when we deliver the presentation. Please note, and I am not saying this to exert pressure, but rather with growing concern, that the message that this state of affairs presents to the community is, in my view, becoming close to dramatic.

peterstadler commented 6 years ago

My apologies for not writing earlier. In fact, Council did discuss this (at the very end of the Victoria f2f) with the following comments:

we like the idea of a new class att.linguistic with the initial members @lemma and @lemmaRef
the new attributes @pos and @msd clearly introduce new concepts to the TEI and should be added to the new class att.linguistic
the idea behind the proposed attribute @join is highly appreciated by the Council, i.e. to explicitly mark significant leading or trailing whitespace. Hence, we wondered whether this could be generalizable (under a different name, reflecting its encoding of whitespace) and whether it should go into the new class att.linguistic then
the proposed attribute @reg could not make many friends. Quoting @jamescummings from a post to TEI-L concerning this issue:

most of the discussion was about the proposed @reg (whose name I certainly don't like for historical reasons). I'm sure I would have argued against the reintroduction of a @reg attribute fearing people would abuse this for what <reg> was created for in editorial transcription and negating the whole war on text-bearing attributes and creation of the <choice> element. I know from the ticket that you think imposing use of <choice> creates too much of a burden for regularisation, but you actually argue more in favour of it when you note that the proposed @reg might need to store multi-word sequences... exactly what we don't want in an attribute! Though your @reg attribute issue 2 on that issue seems to ignore that <w> can self nest? Surely that would be the solution for multi-word units needing a single @reg? And I'm not against the introduction of new linguistic attributes, though think this often ignores the power of XML child hierarchies. Personally, I want to avoid the storage of any free text of any sort in any attribute, that is I like attribute values to be strongly tied to processable, checkable, datatypes.

Adding to this, not only does <w> self nest but it can also have <choice> as a child element. So the example "Tagging of highlighting within a word (here: typeface switch from Fraktur to Antiqua)" could be rewritten as

<w lemma="wohlstilisierend" pos="ADJA">
   <choice>
      <orig>Wohl-<hi rendition="#aq">ſtyliſi</hi>rende</orig>
      <reg>Wohl<hi rendition="#aq">stilisie</hi>rende</reg>
   </choice>
</w>

or even

<w lemma="wohlstilisierend" pos="ADJA">
   <choice>
      <orig>Wohl-<hi rendition="#aq">ſtyliſi</hi>rende</orig>
      <reg>Wohlstilisierende</reg>
   </choice>
</w>

and the example "Word syllabification and hyphenation at the end of a page" as

<w>Flecken</w> <w>oder</w> <w>Dorff</w> 
<w>
   <choice>
      <orig>
         deſ-<lb/>
         <fw place="bottom" type="catch">ſelbi-</fw>
         <pb facs="#f0672" n="656"/>
         <fw place="top" type="header">
            <hi rendition="#b">Von der</hi>
            <hi rendition="#aq #i">Præſtan</hi><hi rendition="#b">tz</hi>
            <hi rendition="#b">und Vortreflichkeit</hi>
         </fw><lb/>
         ſelbigen
      </orig>
      <reg>desselbigen</reg>
   </choice>
</w>
<w>Landes</w>

rvdb commented 6 years ago

I second @peterstadler concerning <reg>: the motivation for a simplified @reg attribute in the original proposal seemed misguided:

However, with the usage of //choice/reg|orig for the annotation of regularized word forms, text annotation in specialized corpora may need to be extended significantly since there may already exist further (elaborate) annotation on or below the token level, which gets automatically multiplied with the use of <choice>. If the <choice> machinery is applied, these existing annotations will end up being represented twice: once in <reg> and once in <orig>.

Instead, <reg> should just contain a regularized form, so there's no requirement at all to repeat the markup of <orig>.

Additionally (speaking from my experience with eXist-db, but I guess this will hold for other XML databases), I don't see how //w/@reg could make life easier when one wants to search for a sequence of regularized forms, since attributes will most likely be indexed as discrete nodes by XML databases. On the other hand, //choice/(reg|orig) allows for much more flexible control when constructing search indexes. In eXist, it's trivial to create an "original" index (by excluding //choice/reg) and a "regularized" index (by excluding //choice/orig) and use either when querying a corpus.

bansp commented 6 years ago

Thanks, @peterstadler , @jamescummings and @rvdb . The technological issues you raise made our group decide to withdraw @reg from this very request, while we review your arguments and potentially get back to the Council with a new, atomic and focused request. (Side remark concerning something James mentioned: @lemma and @reg are by definition not free text. Let's please refrain from continuing this thread here; we promise to get back to this in that separate FR).

I have modified the pull request to remove @reg from it. I am sure that Mr. Jenkins will soon come up with the modified proposed text of the Guidelines and class documentation (links at the top).

We are only left with somewhat unclear feedback on @join. Personal preferences regarding attribute names aside, what is at stake here is the ISO+TEI common strategy. The @join attribute has been standardized by ISO and adopting it for the Guidelines comes at no cost to the Council, because it is not the Council who are going to be blamed for its potential inadequacies. It would also be costless in terms of effort to keep it within att.linguistic, because the alternative means either confining it to <w> directly or searching for another suitable class. The former is bad practice and customization-unfriendly, while the latter would require searching for a suitable class and motivating that choice. That being said, we are OK with that attribute being added to some other sensible class, if the Council wishes to make the effort and if that will not delay the enterprise by another six months. We are also OK with trimming the value set down to "left", "right" and "both", if the Council decides that it can't live with all the ISO-defined values being used by the TEI. We are also OK with it getting renamed if that pleases the Council.

The optimal course of action from our perspective would be as follows:

the Council accepts the pull request (without @reg)
we create a new pull request just for the AI chapter, to include some discussion and a few examples from the ticket and from the att.linguistic documentation (that one would hopefully only require cosmetic changes to the prose suggested by us).

We hereby request further action by the Council on this ticket, without unnecessary delay.

bansp commented 6 years ago

One more note on @join: while it feels not as "linguistic" as e.g. @pos, its functionality has been recognized as necessary and has wide use in language technology contexts:

in the National Corpus of Polish (starting in 2008), we used @nkjp:nps for "no preceding space" to remedy the loss of information in standoff annotation
ISO MAF, from which the proposed name and definition are taken, has been in use since 2012
Universal Dependencies in the CoNNL-u format use a feature called SpaceAftermixed in with morphosyntactic features.

This can hopefully form sufficient motivation to keep @join together with the other attributes proposed for att.linguistic. At the same time, @join as defined in MAF offers support to the widest range of possibilities.

In reference to a fragment from James's e-mail message "I think I wondered what happens if two adjacent words have some form of conflicting @join, i.e. is this an error", let me state the obvious, that given how rich the range of possibilities offered by the TEI is, it is trivial to create internally conflicting descriptions at any level (and you don't need the TEI for that, words are enough...). It would be unrealistic and eventually harmful to try to limit the descriptive coverage of a framework that is built for customization only because someone could create an incoherent description in it when using the full permitted range of options. Yeah, they could, but they will be responsible for that, not the TEI.

bansp commented 6 years ago

Please note: Mr. Jenkins doesn't like our branches and that's why the documentation still shows @reg (it's set to the latest stable build, and that build is over a month old; it's not the pull request that is causing the instability, but rather something to do with debian directories, maybe some fixes have taken place concerning the TEI-C Jenkins that are not yet copied to ours). If you examine the diff inside the pull request, you will see that @reg is not there any longer.

peterstadler commented 6 years ago

Thanks @bansp for updating the PR. We currently have some trouble with the build process but as soon as everything is back to stable I will merge the current PR #1671. If interest arouses for a more generic attribute @join this should be a separate feature request.

eduarddrenth commented 6 years ago

thnx!

in the mean time made quite some progress with https://bitbucket.org/fryske-akademy/tei-encoding

Op 26 jul. 2017 3:05 p.m. schreef "Susanne Haaf" notifications@github.com:

Hi Eduard, thanks for your remarks. One quick note: There are examples on @pos and @msd in the proposed specification of att.linguistic https://lingsig.github.io/wordAttributes/html/ref-att.linguistic.html which is also referenced in the "Quick links" section of the current issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/TEIC/TEI/issues/1670#issuecomment-318047187, or mute the thread https://github.com/notifications/unsubscribe-auth/AJulKHeTs-DLp3P5H_FLJd3epZuSo_oSks5sRzm3gaJpZM4OgMrQ .

bansp commented 6 years ago

Thank you, @peterstadler and Council!

TEIC / TEI