cldf-clts / pyclts

Apache License 2.0
11 stars 2 forks source link

CLTS knows the "ultra-long" length, but pyclts only parses it for a closed list of vowels. #45

Open XachaB opened 3 years ago

XachaB commented 3 years ago

Hi,

I saw that CLTS knows the "ultra-long" length:

https://github.com/cldf-clts/clts/blob/d52aa50ec524b590f7715334b58c89b839dc585b/pkg/transcriptionsystems/features.json#L65

However, pyclts only parses it for a closed list of vowels, which are coded extensively in the vowels file, e.g.:

https://github.com/cldf-clts/clts/blob/cccee296b1e54e653e1b4bea103bf0e870072765/pkg/transcriptionsystems/bipa/vowels.tsv#L52

I am working with Nuer right now, where morphological contrasts can combine tone, 3 levels of length, vowel quality, and breathyness; which leads to many attested combinations. When other diacritics than length are involved, pyclts incorrectly parses "ultra-long" as if it were "long":

import pyclts
clts = pyclts.CLTS()
o1 = clts.bipa["oːː"]
print(o1, o1.featuredict["duration"])
o2 = clts.bipa["o̤ːː"]
print(o2, o2.featuredict["duration"])
o3 = clts.bipa["ó̤ːː"]
print(o3, o3.featuredict["duration"])

outputs:

oːː ultra-long
o̤ː long
ó̤ː long

I saw that the ultra-long diacritic is not in the diacritics file:

https://github.com/cldf-clts/clts/blob/cccee296b1e54e653e1b4bea103bf0e870072765/pkg/transcriptionsystems/bipa/diacritics.tsv#L82

However, adding a row in that file with double "ː" is not enough, from which I would guess that the parser does not allow any combinations of sound + compatible diacritics.

Is this intended behavior ? I understand that for some applications, losing such fine sound resolution might not matter. For morphology, where I was hoping to use CLTS as a parser (in order to obtain featural definitions from grapheme sequences, and where I want to trust the data sources), the contrast between, for example, long and ultra-long, is sometimes crucial.

XachaB commented 3 years ago

Narrowing it down, it looks like the cause is the assumption that diacritics are a single character long here, when iterating on each remaining diacritic characters to check for a match (same thing with diacritics before the sound, a few lines above):

https://github.com/cldf-clts/pyclts/blob/c7420d9c59122210422ae249f00a37284cda98c8/src/pyclts/transcriptionsystem.py#L284

An alternative could be to segment diacritics using a regex based on the diacritics from the diacritics file.

XachaB commented 3 years ago

See PR #46 for an example of what parsing multiple character diacritics could look like. If something like this is accepted, then adding a row for ultra-long to the bipa diacritics in CLTS would be the next step.

XachaB commented 3 years ago

@xrotwang, @LinguList : could someone have a look at this ?

LinguList commented 3 years ago

@XachaB, modifying one part of the parsing procedure always has larger consequences for other parts as well. I prefer to code the vowels with long diacritics extra, in our vowels.tsv file. This always works, and it has the advantage of not fiddling with the code.

LinguList commented 3 years ago

So one would only have to assemble all ultralong sounds (which should not be many) and add them to the vowel.tsv file in the clts package.

XachaB commented 3 years ago

I understand not wanting to touch the parser, for fear of breaking something else (though there are tests, and they do all pass).

Is there a more general reason to not allow combinations of diacritics which are more than one character long with their compatible sounds ? Currently the parser does allow the arbitrary combination of diacritics, as long as these diacritics are 1 character long. The distinction seems to me to be purely a detail of implementation, rather than a feature.

Arguing further in favor of allowing for all combinations, I would add that it could be possible for a language to display any combination of other vowel diacritics + this over-long length. Maintaining a closed list means that one needs to know in advance all of the possible graphemes which the users might want to use. For all cases but multi-char diacritics, CLTS is able to parse previously unseen, valid BIPA graphemes (that is combinations of C/V and compatible diacritics).

For my own purposes, for now, I only need vowels and consonants which exist in Nuer, Dinka or Estonian.

The vowels which need to be added for this are:

aːː
a̤ːː
e̤ːː
i̤ːː
o̤ːː
æ̤ːː
ɑːː
ɔːː
ɔ̤ːː
ə̤ːː
ɛ̤ːː
ɤːː
ṳːː

Estonian also has the overlong consonants:

fːː
hːː
kːː
lːː
mːː
nːː
pːː
rːː
sːː
ʃːː
tːː
vːː

But of course, these overlong sounds are not the only existing ones in the world. For example I think Wichita has ɪːː, and I imagine there are others which I do not know about.

LinguList commented 3 years ago

Sorry, did not see you made a PR. See my comment there. I can only get back to this next week, but will then do so and check, much likely I'll follow your proposal, but want to understand it in detail first.

XachaB commented 3 years ago

I see now that it doesn't pass all tests, I can check why and try to fix later on too. I'll check your comments there too

XachaB commented 3 years ago

Ok, the discussion on the PR concluded to avoid changing the parser. As a fix, I made two PRs above which introduce more ultra-long sounds, and breathy ultra-long sounds in the respective CLTS BIPA files.

Still, this does not completely solve the problem. For example, if a user used an alias for tone and specifies, e.g. "ó̤ːː" (Nuer has both tones and 3 levels of length, and breathyness), then this will still be mis-recognized as "ó̤ː".

I am hesitant, however, to propose all combinations of tones for all ultra-long Nuer vowels, breathy & non-breathy, as we are starting to get into a large number of sounds. I know for lingpy & lexibank you prefer noting tones after syllables using the number system; though writing tones using diacritic is also a widespread practice, and might make a lot of sense for some specific applications (I am again thinking of morphology, where it can be very useful to mark tone as a supra-segmental feature of the vowel).

How should we proceed ?

xrotwang commented 3 years ago

I'd say that a "widespread practice to write tones using diacritics" sounds a lot like the use case for shareable orthography profiles - i.e. transparent pre-processing of the data.

XachaB commented 3 years ago

That is if you want to "normalize" to the practice of marking tones separately as numbers, however, depending on the task/application, it might make more sense to process them as diacritics on segments, in which case I would not want to pre-process them to what CLTS prefers (for reasons that are specific to lingpy, if I understand correctly). In that case, it would be too bad if my choice was either a representation that isn't quite right for my task, or not using CLTS.

XachaB commented 3 years ago

Note that I am not asking for a general change of behavior in pyCLTS. pyCLTS already parses correctly tones as diacritics:

>>> import pyclts
>>> bipa = pyclts.CLTS().bipa
>>> print(bipa["à"], bipa["à"].featuredict["tone"])
à with-low_tone

Except in this edge case of the ultra-long diacritic. And in this case, what is incorrect is not the parsing of tones, it is the parsing of ultra-long which gets reduced to long:

>>> print(bipa["àːː"], bipa["àːː"].featuredict["tone"])
àː with-low_tone
LinguList commented 3 years ago

@XachaB, my experience with hand-annotating cognates for alignments is maybe useful here, as we also work on African languages and SEA languages. By now, I have come to the conclusion, also in discussions with colleagues, that tone should be listed in an extra tier, another sequence, that mirrors the syllable and tonal structure.

LinguList commented 3 years ago

As a result of this decision, which has grown over the years, I am strictly against the fruitless on-vowel annotation, and for this reason, it is also not really supported well in CLTS, but rather follow the common practice of many linguists to put tone at the side in initial analyses of African languages (and therefore not mark it at all in lexibank datasets, but keeping trace by using the slash-construct á/a), and recommending people to really start annotating tone in the form of tiers.

LinguList commented 3 years ago

So in the tier-case, you'd have a sequence k o ŋ o m u and a tonal tier that would e.g., be 1 1 2 2 1 1, indicating the individual tones in which the segments occur. This is also reflecting a proposal of Hoenigswald to explain Grimms/Verner's law, where he does label accented consonants differently than non-accented consonants (or pre-accented, whatever one wants to model).

LinguList commented 3 years ago

My proposed solution for these high-tone cases would be to list them exhaustively, if one wants to do so. But note that á is far from being unambiguous as an annotation, since it means so many different things in transcription systems and is often confused. For this reason as well, I'd rather explicitly clean all data for now from tones and restrict analyses involving tones to explicit cases, where one can really control for them, i.e., small datasets which one annotates etymologically in good faith to uncover sound laws.

XachaB commented 3 years ago

Thanks for the long answer.

On tones specifically:

I am really interested in the proposition for multi-tiered sequences. This is a common recurrent problem, which I find particularly difficult to get right. Having the input already specify the full tonal tier indeed sounds like a really good thing. Do you also require some information on syllable structure ? I know you have tools in lingpy to guess it, but of course it can't match perfectly expert jugement. E.g. shouldn't there be a difference in the tonal tier between b a b a 1 1 1 1 (two syllables with the same tone) and b a k s, 1 1 1 1 (a single syllable) ? Do you then treat each sequence completely independently, or do you have an implementation for truly multi-tiered sequences ? If you do, I am extremely interested in seeing how it works, and what sort of manipulations it allows. This is off-topic for this particular issue, but I am very curious to learn more :)

á is far from being unambiguous as an annotation, since it means so many different things in transcription systems and is often confused.

On this I very much agree.

Aside from the question of tones in particular, my point was more generally that the number of potential diacritic combinations with ultra-long can be very high. In short, this is a draw-back of the "exhaustive list" strategy.

LinguList commented 3 years ago

We are currently exploring the multi-tiers on a larger dataset with some additional new ideas on handling sequences for historical language comparison, I'll gladly share the draft in an early stage with you, @XachaB.

As to the exhaustive list strategy: our vowel list is already extremely large, since I generated most of the vowels artificially to have a high coverage from the beginning. I think, it is still easier to add 100 more vowels and to see where this goes, potentially also with tools that make the combinations explicit, like my JS profiler (which I also use for orthography profile creation at times), rather than to increase the productive power of the CLTS system more: it covers already 8000 types, and we could never check if all of them make sense, so I'd rather not extend the power further by now...

XachaB commented 3 years ago

Ok, I'll make further PRs as needed, then. I think this choice is at least a pragmatic solution to the current issue.

Re: multi-tiers, with pleasure ! I'm interested in the question for synchronic analyses, but I expect many questions to be common.

More thoughts, and then I'll stop to not veer too much off topic:

I realize that with tones marked as they are usually in Lexibank b a 1 b a 2, they provides a sort of ad-hoc syllable segmentation since they are marked after the syllable. This means that the multi-tiered representation can be recovered automatically from the segments sequence, by expanding the numbers to all positions in the syllable. But this doesn't extend, for example, to length or stress, which can be supra-segmental too, but are often marked on the nucleus or vowel. If one wanted to "expand" them to the full syllables, then one would need the actual syllable segmentation.

If the user input, however, does include syllable segmentation (or even better, a precise syllable tag sequence), then all supra-segmental tiers should be recoverable automatically. But maybe this is just as much work as asking for a separate sequence for segments, tones, length, and stress ? I'm not sure.

LinguList commented 3 years ago

I have tested some annotations on stress, but it was quite difficult, since as you point out, stress should reflect syllable boundaries, etc. In most SEA settings, the tone-after-syllable is okay, since syllables are also morphemes, but I doubt the syllable segmentation is useful in other languages. And here I am still at a loss, since syllable boundaries contradict historically interesting signal, or morpheme boundaries, such as, e.g. Herbst "autumn" Herbst-es "autumn, genitive", which would of course be syllabified as "h E r p s . t ə s". I have experimented a bit on double-annotation inside a sequence, combining morpheme boundaries and syllable boundaries, but gave it up and decided that syllables would be best annotated in an extra tier. And for SEA languages, I would also for the reason of making this cleaner, put tone in an extra tier, but when annotating data, one can leave it.

Ah, last point: that stress is annotated on the vowel means we cannot retrieve the tier, this is correct, we need to really annotated it manually for languages where we know it. This is why I don't like the vowel-stress-marking or vowel-tone-marking in a first instance, since it is not truthfully reflecting the phonology in a transcription.

XachaB commented 3 years ago

Thanks, that's a lot of food for thoughts