Open Anaphory opened 5 years ago
Indeed this is the case:
import segments t = Tokenizer() t = segments.Tokenizer() t("tʰaʰt") 't ʰ a ʰ t' t("tʰaʰt", ipa=True) 'tʰ aʰ t'
and as you say, pre- and post- aspiration/nasalization etc., that Unicode denotes with Unicode Spacing Modifier Letters
https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L86-L92
is beyond complexity in segments
. Here I would do something like apply tokenization without ipa=True and use a profile to specify the pre-x cases with rewrites. The current functionality aims to deal with the "more often likely" cases, i.e. post-aspiration/nasalization. But ideas of how to improve are welcome (maybe IPA=True isn't very transparent)!
I have seen much more pre-nasalization (regular in some languages I have data on) than post-nasalization (never), so I added a few lines to my combine_modifiers
to deal with pre-nasalized consonants, along the lines of the existing code dealing with stress marks. I'm likely to push that at some point.
The fact that pre-aspiration cannot be easily handled is transparent enough, and even just fixing that by saying ‘vowels are never post-aspirated’ would require much more complexity than segments
is supposed to provide, and having to use a profile for that is exactly as expected.
Would be great to have those extra lines, so please do push. Thanks!
I just tried to use
segments.Tokenizer()(x, ipa=True)
on some data containing pre-aspirated and pre-nazalized consonants and wondered that a subsequentpyclts.TranscriptionSystem('bipa')
call complains about very many undefined segments. Apparentlysegments
does not know to associateᵐ
,ᵑ
andⁿ
with the subsequent sound, but appends them to the preceding vowel. (A similar problem exists with pre-aspirated consonants, but in that case I understand that distinguishing between pre- and post-aspiration is beyond the complexitysegments
wants to provide.)