IPA tokenizer does not recognize pre-aspiration and pre-nazalization

cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation

Apache License 2.0

31 stars 13 forks source link

IPA tokenizer does not recognize pre-aspiration and pre-nazalization #41

Open Anaphory opened 5 years ago

Anaphory commented 5 years ago

I just tried to use segments.Tokenizer()(x, ipa=True) on some data containing pre-aspirated and pre-nazalized consonants and wondered that a subsequent pyclts.TranscriptionSystem('bipa') call complains about very many undefined segments. Apparently segments does not know to associate ᵐ, ᵑ and ⁿ with the subsequent sound, but appends them to the preceding vowel. (A similar problem exists with pre-aspirated consonants, but in that case I understand that distinguishing between pre- and post-aspiration is beyond the complexity segments wants to provide.)

bambooforest commented 5 years ago

Indeed this is the case:

import segments t = Tokenizer() t = segments.Tokenizer() t("tʰaʰt") 't ʰ a ʰ t' t("tʰaʰt", ipa=True) 'tʰ aʰ t'

and as you say, pre- and post- aspiration/nasalization etc., that Unicode denotes with Unicode Spacing Modifier Letters

https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L86-L92

is beyond complexity in segments. Here I would do something like apply tokenization without ipa=True and use a profile to specify the pre-x cases with rewrites. The current functionality aims to deal with the "more often likely" cases, i.e. post-aspiration/nasalization. But ideas of how to improve are welcome (maybe IPA=True isn't very transparent)!

Anaphory commented 5 years ago

I have seen much more pre-nasalization (regular in some languages I have data on) than post-nasalization (never), so I added a few lines to my combine_modifiers to deal with pre-nasalized consonants, along the lines of the existing code dealing with stress marks. I'm likely to push that at some point.

The fact that pre-aspiration cannot be easily handled is transparent enough, and even just fixing that by saying ‘vowels are never post-aspirated’ would require much more complexity than segments is supposed to provide, and having to use a profile for that is exactly as expected.

bambooforest commented 5 years ago

Would be great to have those extra lines, so please do push. Thanks!