cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Deal with prenasalization marks #43

Closed Anaphory closed 5 years ago

Anaphory commented 5 years ago

Deal with pre-nasalization the same way as with stress marks, and do both before dealing with the general case to avoid maintaining a list of exceptions in line 309

LinguList commented 5 years ago

Guess the reason this fails is that you need to modify the tests as well, right?

xrotwang commented 5 years ago

@bambooforest The change proposed by @Anaphory makes the test with Brazilian Portuguese data fail, so clearly isn't backwards compatible. So the question is whether the current behaviour can be regarded as buggy, or if we consider the current behaviour correct.

Looking at the code, and seeing that there already is special handling for stress marks (0x2c8 and 0x2cc), @Anaphory's change at least would not introduce additional complexity.

LinguList commented 5 years ago

I clearly support @Anaphory's proposal, although this is a minor thing, given that a detailed orthography profile should be able to handle this anyway, right?

Anaphory commented 5 years ago

Yes, this commit breaks something implicitly considered part of the API through the promise of tests. The specific thing the test asserts is that # v ẽⁿ t ʊ #, not # v ẽ ⁿt ʊ #, is the correct segmentation of that word.

In case of your style of orthography profiles, you can handle it without recourse to this functionality of segments anyway. My case of multi-step transformations needs a finals step of splitting an IPA string into segments, and given that post-nasalization is really rare (in PHOIBLE I found one single dⁿ, versus several dozen ⁿd also in other contexts) this change made sense to me. (I may switch to your style of orthograpy profiles at some point, but that would be a major overhaul I don't have any incentive to do now.)

tresoldi commented 5 years ago

I also support @Anaphory 's proposal, both from code and articulation points-of-view.

xrotwang commented 5 years ago

Any opinion on this, @bambooforest ?

bambooforest commented 5 years ago

Sorry, was ooo. Looking into this.

bambooforest commented 5 years ago

I'm reluctant to support this change, but I'm open for discussion. Here's why.

The character in question is the IPA diacritic for nasal release and not pre-nasalization (Unicode character U+8319 SUPERSCRIPT LATIN SMALL LETTER N <ⁿ>).

Although it is sometimes used in the literature to denote prenasalization, as far as I'm aware, there is no IPA-sanctioned transcription practice for it (also, what prenasalization is, is an area of debate, i.e. is it a sequence of nasal+obstruent of is it a single consonant):

https://escholarship.org/uc/item/9d93t9t9

In PHOIBLE, pre-stopped nasals adhere to strict place-matching, e.g. , (this is in line with the suggestion by Ladefoged and Maddieson 1996):

http://phoible.github.io/conventions/

So in fact, the few occurrences with <ⁿ> in phoible are incorrect, e.g.:

7 Caodeng rGyalrong ⁿb 2257 8 Caodeng rGyalrong ⁿd 2257 9 Caodeng rGyalrong ⁿdz 2257 10 Caodeng rGyalrong ⁿd̠ʒ 2257 11 Caodeng rGyalrong ⁿɖʐ 2257 12 Caodeng rGyalrong ⁿɡ 2257 13 Caodeng rGyalrong ⁿɢ 2257 14 Caodeng rGyalrong ⁿɟ 2257

This is a bug that needs to be fixed because these inventories come from EURPhon, which has its own IPA-like transcription system (apparently it follows Tibetan practice of using superscript n for pre-nasalization):

http://eurasianphonology.info/segments

As a nasal release diacritic encoded as a "Letter, Modifier [Lm]" in the Unicode Standard, this symbol is meant to be able to go before or after the base glyph (like the aspiration diacritic). Currently we concatenate LM left, in the optional combine_modifiers function, because this captures greater typological diversity.

Ideally all LMs are be specified by an orthography profile, but for quick analysis we provide the combine_modifiers function.

Instead of pushing this single case:

https://github.com/cldf/segments/pull/43/files#diff-8925b2b6a8842072dd70e99face3c93bR314

perhaps it would be better to have it either as another standalone function that is applied after combine_modifiers, such as prenasalize, or divide up the current functionality of combine_modifiers into something like combine_modifiers_left and combine_modifiers_right?

xrotwang commented 5 years ago

Ok, as I understand this, there is no unambiguous way to interpret this character - so the current behaviour of segments is surely not a bug. Given this, I'd say

bambooforest commented 5 years ago

As noted, combine_modifiers does make an assumption that LMs are concatenated to the right of the base glyph (unless the string starts with one, then it is appended to the left -- which is the typical case of preaspiration, i.e. word initially).

The nasal release is also usually a phonetic transcription. That's why it appears in the Brazilian case (it's a narrow transcription).

LinguList commented 5 years ago

Agreed with @bambooforest and @xrotwang: for more complex applications, one can always write an orthography profile.

Anaphory commented 5 years ago

Shall we close this then, and start a separate issue for the feature request of providing slightly more flexible handling of modifiers that combine to the next segment on the right and modifiers that combine to the next segment on the left, where the default will be that all modifiers combine to the left apart from stress marks, so that it generalizes the current behaviour without changing it?

LinguList commented 5 years ago

I wonder, honestly, why this function should be part of segments, @Anaphory: this behavior of splitting things by a regex is in my opinion going far beyond what the library is supposed to do for now. As an alternative to split text, based on what you call modifier, etc., it might as well be a standalone function, right? e.g., one that is similar to the ipa2tokens in lingpy, but extended, with respect to tweaks based on preceding modifers?

Anaphory commented 5 years ago

I thought (a) I was just suggesting lists of characters, not regexes and (b) that ipa2tokens was supposed to be at least to some extent legacy code superseded by this package here, is that not the case? It has been a while since I looked at it.

LinguList commented 5 years ago

Not really. Segments is not a dependency of lingpy, and we use lingpy's ipa2tokens to generate initial orthoprofiles. I think it would have a place here, what you suggest, but I would not put it inside the class of an orthoprofile. Anyway, I think what you want, but I may be wrong, is something like the call signature we have for ipa2tokens...

bambooforest commented 5 years ago

I've always seen this package as a simple tool for segmenting (tokenizing) strings according to Unicode Annex #29:

https://unicode.org/reports/tr29/

for Unicode characters and (extended) grapheme clusters (hence the regex \X matches for the latter).

Additionally, we offer segmentation of Unicode's tailored grapheme clusters, but instead of using Unicode Locale's "heavy-weight" XML specification for localization:

http://cldr.unicode.org/locale_faq-html

we use the more simple approach of orthography profiles as CSV files, detailed here:

http://langsci-press.org/catalog/book/176

How one generates their initial orthography profile is up to user (I use segments package; as @LinguList mentioned he uses ipa2tokens in LingPy). segments gives a low-level Unicode-compliant approach without making assumption regarding linguistic transcription -- that is, except for combine_modifiers, which I added as a convenience function given what I've seen in the thousands of grammars used in phoible's phonological inventories.

As far as I'm concerned, @Anaphory I welcome another convenience function for left-joining LM diacritics to the base glyph.

The only issue I see here is the potential of backwards incompatibility if we rename combine_modifiers to something like combine_modifiers_right -- @xrotwang . But again, given what I've seen in the phonological typology literature, what we currently have is the default case (so we could mention this explicitly in the documentation).

xrotwang commented 5 years ago

@Anaphory yes, closing and starting a new issue thread for a specific API enhancement seems the best way to go.