Creating Orthography Profiles

LinguList commented 7 years ago

I try to teach @nh36 and @amritavira to advance on the orthography profiles by correcting them themselves. I understand that what I have been dong so far was not directly explicit. So please let's use this issue to discuss this. I'll try to explain as properly as possible.

First, note that an orthography profile is nothing else than a spreadsheet, an excel or an open office file. Our github-format is "tsv", but this is just a representation, you can easily load them in your favourite spreadsheet editor. We usually have several columns in these files, and the first line, the header, tells you what the column is about. Look at orthography.ts by Burling1967 for an example. There, you find the following columns (column name first, explanation afterwards):

Grapheme, the grapheme combination as found in the data (may be scrambled, so in case of doubt, please check with the original, reason for scrambling is a software feature, that should be adapted at some point by me)
Segments, the way we want to represent the data later, so we can compare ACROSS datasets, basically a transcription following IPA suggestions, but with some freedom (see below)
Structure, the prosodic information on the Graphemes, where we distinguish a basic template of "i m n N c t" (initial, medial, nucleus, second vowel, coda, tone). This is crucial for comparison, and all segments defined here should directly correspond to the Segments column (see below)
Unicode, internal information for us to see whether strange unicodes are employed, no need to worry about
Frequency, a working-info for us to see how frequent we encounter a given sub-sequence (important to identify errors or idiosyncratic stuff)
Reflexes, the languages in which the subsequence in Graphemes is encountered. Useful for checking with the data

The profile is always created first automatically from a file always called "words.tsv". In case of Burling, this is this file). Please open this file when working on a profile, and if you wonder about strange sub-sequences in Graphemes column, copy and search where it occurs. Use the information on Reflexes and Frequency to identify those segments. If you don't find them, add a column Notes to the profile and write in there in some regular fashion that you don't find it (feel free to make use of the Notes column also for other stuff).

Segmentation is based on space characters (`) as separator. So the program will split the string whenever there are spaces, and if you add a couple of spaces in the Segments column, you should add the same number of spaces in the column Structure and make sure you follow the template. If you encounter words which are more than mono-syllabic, add a+` character (with spaces!) to indicate this as well, both in Segments and in Structure.

The main work of creating ortho-profiles is correcting the Segments and the Structure, while the rest can be left untouched. As an example, consider the line 15 in the ortho-profile. Here, we have yiŋ in Graphemes, and y iŋ in Segments, and n c in Structure. This is of course wrong, as correctly, the row should show:

Graphemes: yiŋ, ! leave it untouched !
Segments: j i ŋ, assuming that y denotes j, as everywhere in Burling
Structure: i n c, assuming that j is initial of the syllable, i is the nucleus, and ŋ is the coda.

In Burling, there's a further problem with tones. I don't know how they are annotated for certain cases, but Burling uses accent marks. We want to separate them and annotate them correctly in our template. Ideally, you find out, which tones they refer to, from the source. But if this is not the case, you can add shortcuts. We have the convention in our annotation that if we do not know the real phonetic value (like, e.g., ⁵⁵ for high tone), we can use the slash / in a segment to separate a discrimination value from some phonetic value. The phonetic value can stay empty. Thus, when checking line 34, for example, the acute accent could be interpreted as our tone ¹ in Atsi. We don't know how it is pronounced, so we write ¹/. If we knew the value, we would write it after the slahsh as, e.g., ¹/⁵⁵. Fore the whole line, we would thus write as follows:

Graphemes: úŋk
Segments: u ŋk ¹/ (or u ŋk ¹/⁵⁵ if you know this is high tone)
Structure: n c t

This is crucial to render the data comparable to other sources. Make sure to keep the slash-notation correct as source/target, and think of laryngeals, for example, where we'd write p h₂/ə t eː r for the word for "father" in PIE.

There are more tricky cases, but it's for you to figure out what Burling meant. Line 39 for example gives: ùmk for Akha. You could argue that the m is just nasalization. If this is the case, you should write it as ũ k ¹/, assuming that grave accent is tone ¹.

One more example is line 139. Here, you need to add a morpheme marker:

Grapheme: àyɛ̀ʔ
Segments: a ²/ + ɛ ʔ ²/
Structure: n t + n c t

So you help the computer to see that these are two syllables.

LinguList commented 7 years ago

As a general rule, keep this in mind: column Graphemes is what we find in the source, colum Segments shows how we want to render it, and column Structure tells us what the segments do in the sequence. You'll see that the automatic approach is already not that bad (consider Mann1998, where I did the profile in less than an hour). But on data like Burling, more manual correction is indispensable.

LinguList commented 7 years ago

I just assigned this issue to @nh36 and @amritavira to indicate that you can close this issue when you think there are no open qs left.

digling / burmish

Creating Orthography Profiles #93