TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
279 stars 88 forks source link

add <w> to att.lexicographic #1776

Closed iljackb closed 4 years ago

iljackb commented 6 years ago

Just as is the case with well established usages of attributes native to att.lexicographic within the dictionary module, there are identical use-cases for these attributes that arise in the development of a text corpus which currently, for lack of a sufficient alternative, require a customized solution in the TEI.

In cases where building a corpus using <w>, it is possible that the forms within these tokens may need to be normalized. The acceptance of this proposal would enable users to be able to do this in one of two ways:

Currently there is no attribute available on <w> which serves this function and the only feature available in the TEI at all is <orig> which does nothing about representing normalized forms either as an element or attribute value.

A specific use case for this proposal is a language documentation project of the Mixtepec-Mixtec language (iso 639: 'mix'), which is an under-resourced language with a very small body of published text booklets for children. In addition to these, new texts written by the project's native speaker consultants make up the core of the project's written material, and together with transcribed speech, these resources form the basis of the TEI corpus being produced. In dealing with such data there are three main factors which give rise to the need for the features in question, they are:

  1. the orthography is still undergoing changes (by a group from SIL Mexico), thus some texts have old spellings;
  2. the spelling conventions are not well known by speakers, leading to a need for significant corrections;
  3. there are also potential instances of sub-dialectal, and/or idiolectal vocabulary use (which we want to keep somewhere while also providing a normalized form for search and retrieval purposes);

Regarding usage case (1); for example in the Mixtepec-Mixtec language, the lexical items meaning 'when' & 'where' formerly both written orthographically as "nchii" and in earlier publications both appeared written as such (phonologically they are minimal pairs based on tone [nd͜ʒiː˥] vs [nd͜ʒiː˥˩]). Given the need to distinguish these items further as they cannot be reliably be understood by context, the word for 'when' retained the spelling "nchii" and 'what' was changed to "nchi". In the TEI corpus, the encoding of the instances of the old spelling of 'what' were changed to:

              `<w xml:id="d1e163" orig="Nchii">Nchi</w>`

Regarding usage case (2); where speakers spell something incorrectly but we would like to preserve it for any number of reasons, the use of @orig is essential and could have uses for both the speaker to see past mistakes, researchers to get insight into how untrained speakers write their language instinctually (in contrast to prescribed convention), etc.:

              `<w xml:id="d1e1435" orig="ntsa sia'i">ntsasia'i</w>`

             Side note: I could imagine `@split` (also att.lexicographic) might be used here instead of or possibly in addition to `@orig` to delineate the morphological sub-components (which corresponds to the original segmentation of the speakers original written form)

Regarding usage case (3); although our speaker consultants are undisputedly part of the Mixtepec-Mixtec area, our speakers are from a small village of only several hundred people a significant distance from the other main populated areas, and the question of whether there are any significant lexical variations between this place and the greater population is not entirely settled. In fact there are certain tendencies demonstrated by at least one speaker that may be candidates for further exploration. Additionally, this particular speaker is less exposed to the language every day and doesn't live in the language region of origin and so these tendencies could be due to idiolectal differences which also may be of use for future socio-linguistic topics.

              `<w xml:id="d1e2363" orig="intu'u">ntu'u</w>`

Finally note, in our project, given that the body of written language is so small, and there is an urgent need to establish a significant body of written text that is consistent, our editorial practice prefers normalization of the element contents and recording of the original in @orig. However, in any of these three cases, depending on editorial preferences, these could have been done in the inverse way, i.e. to give precedent on the preservation of the original texts and place the normalized form in the attribute @norm. e.g.

        (1)
              `<w xml:id="d1e163" norm="Nchi">Nchii</w>`
        (2)
              `<w xml:id="d1e1433" norm="ntsasia'i">
                   <w xml:id="d1e1434">ntsa</w>               
                   <w xml:id="d1e1435">sia'i</w>
                </w>`
         (3)
              `<w xml:id="d1e2363" orig="intu'u">ntu'u</w>`
bansp commented 6 years ago

I wish I had seen this request earlier. Do I assume correctly that the core of this request is that @orig and @norm be available to more than just lexicographic items? That could also follow if the two were separated into, say, att.normalize. This potential new class could then be used by att.linguistic and in this way, the two attributes would make their way into <w> and <pc> (because the latter also needs to deal with normalization issues).

(I found this ticket by searching for the ticket suggesting that @norm be moved to att.global. Would someone kindly reference that ticket here, if I hadn't imagined it? Clearly, there is a momentum here worth exploiting. <w> needs @norm badly, and we didn't dare suggest that, for symmetry, @orig would be nice to have there as well -- because some projects concentrate on the normalized side, while longing to record the source.)

lb42 commented 6 years ago

I just note that this reverses a sensible decision reluctantly taken long ago during the war on attributes. In particular it seems very likely that the value of @orig might need to contain markup constructs such as <g> or <hi> which you could not supply.

martindholmes commented 6 years ago

Both James and I sadly noted exactly the same thing. :-)

bansp commented 6 years ago

Let me take a stab at that, with apologies to Martin for repeating some statements from a discussion earlier today.

The above is something that corpus creators are rather acutely aware of. The attributes that Jack mentions are therefore not meant for any and all source forms or normalized forms, but rather for a relatively precise subset of them -- those that can be used to provide information at the level of <w>. It is also true that a beautiful way to handle such cases would involve many more elements, more complex structures, and probably a lot of links. The thing in this case, and similar cases, is that we're not aiming at something beautiful, but rather for something practical, open to manipulation by tools that process a sequence of <w> elements and need to find the relevant information locally. Very often (unlike the case that Jack mentions but increasingly so in many similar projects), the amount of data to be processed also plays an enormous role -- there's no processing power to follow all the links and disassemble beautiful structures when you're up against gigabytes of data.

Technological issues aside, there's also a strand of argumentation concerning cases like these and many others that adopts a stance of a somewhat irritated Ubercreator saying "nah, I won't allow this in the schema because a novice encoder might produce utter gibberish if that were allowed". Newsflash: a novice encoder may produce gibberish out of nearly anything, and innovation is often born where there's freedom to follow new goals and describe new data (otherwise, part of the forced "innovation" in this case may turn out to be adoption of a different XML encoding format or cooking up your own).

Summing up: Jack here and others elsewhere are not saying "abandon the old ways and henceforth follow our new solution exclusively". We're saying: "for a precisely delimited set of cases, and under relatively tight technological constraints, adopt this if you're sure you know what you're doing". In the spirit of the TEI being a toolkit for creating schemas, we propose a well-described set of advanced components for specialized users.


Let me also reference issue #1670 for more examples of where the need for @norm is dear (it's called @reg there). Various issues concerning the limits of analogical kind of annotation are mentioned/discussed in a recent LREC paper by Martin Mueller, Susanne Haaf and myself. We are also going to talk about this at the upcoming LingSIG meeting.

lb42 commented 6 years ago

I believe that xml:lang specifies the language both for element content and for some attributes (depending on their datatype). I don't think XML allows you to change that rule, so if you want to give your attribute attributes (as it were) you're on your own.

sydb commented 6 years ago

Not sure which W3C standard you’re referring to, @lb42. My vague recollection is that the W3C had never considered characters outside of Unicode, and was happy to give you enough rope to hang yourself with if you put text that might need markup or language identification in your attributes. But I may well be mis-remembering.

But no matter. The argument against this ticket would be a lot stronger if the proposers were asking us to add a new attribute that violated the principles over which the War on Attributes was fought. But they’re not, as @norm and @orig already exist on quite a few elements. So they are just asking to tweak the peace treaty. Thus my instinct is to: a) agree to the proposal, add <w> to att.lexicographic; and b) use this as an opportunity to add a health warning to @orig and @norm.

My initial thought is that the health warning should comprise both a short warning in the <remarks> of each attribute and a longer discussion that includes alternate encoding that allow language identification, highlighting, and characters outside Unicode; and that the former should point to the latter.

lb42 commented 6 years ago

@norm and @orig don't exist on any non lexicographic elements do they?

sydb commented 6 years ago

Nope; just att.lexicographic.

raffazizzi commented 5 years ago

F2F subgroup agrees with @sydb: make <w> a member of att.lexicographic is acceptable, but change desc and remarks for @norm and @orig to clarify that they are not to be used outside of a lexicographic context and to look at <orig> and <reg> for other uses.

tuurma commented 5 years ago

@tuurma and @ebeshero to propose rewording of GL to make very clear when it's acceptable to use text-bearing attributes

ebeshero commented 5 years ago

Council agrees in discussion with the proposal, but wants to implement a strong warning in the Guidelines, especially for the attributes bearing the teitext datatype. The warning would be to add clarification that @norm and @orig are not the same as their element counterparts and have precise linguistics use. Wording might be, "The attributes in this class are meant to be lexicographic use and are not intended to be used for editorial interventions."

bansp commented 4 years ago

I have submitted ticket #1973 that essentially does what I described in a comment above: introduces a separate attribute class to hold the two attributes in check, with a warning against abuse that the Council decided to add (see Elisa's comment immediately above). The ticket is accompanied by bits of documentation and it would be great if the Council could review it whenever convenient.

Thanks in advance for considering that -- enabling the use of @orig and @norm on <w> and <pc> would not only allow projects such as EarlyPrint to become 'legal', but would also allow work on developing a complete TEI serialization of the ISO MAF ("Morphosyntactic Annotation Framework") standard. Cheers!

duncdrum commented 4 years ago

If there is interest I d be happy to contribute a CJK example that avoids the use of <g> inside an attribute value.

bansp commented 4 years ago

I think that such an example would definitely have general value also outside of this narrowly scoped discussion.

ebeshero commented 4 years ago

Council greenlights this with some modifications to @bansp 's pull request: to add <w> to att.lexicographic.normalized and to apply cautionary language. (We'd want to create a subclass of att.lexicographic. (Ask @bansp to modify the pull request accordingly.)

bansp commented 4 years ago

That is now done and merged. I do hope that @iljackb will find the result satisfactory!

iljackb commented 4 years ago

Yes I do indeed find it satisfactory!

Thanks to everyone, especially Piotr @bansp for seeing this through!

On Wed, May 6, 2020 at 9:32 PM Piotr Banski notifications@github.com wrote:

That is now done and merged. I do hope that @iljackb https://github.com/iljackb will find the result satisfactory!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TEIC/TEI/issues/1776#issuecomment-624845766, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYQ2HBJ2C4EQOLDOTMZBXDRQG3MPANCNFSM4FESJRTA .

martinascholger commented 4 years ago

So, I'm closing this. Thanks @bansp!