cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Parsing glitch with dental click (?) #26

Closed SimonGreenhill closed 6 years ago

SimonGreenhill commented 6 years ago

I've been staring at this for 30m but can't figure out why the dental click (ǀ', where "ǀ" = Latin Letter Dental Click) is being split into two graphemes ('|' and '’')

import segments

def passthru(c): return '{%s}' % c

sandawe = 'ǀ’ùsù'
inventory = 's,ǀ’,x,ù'
prf = segments.Profile(*[{'Grapheme': g, 'mapping': g} for g in inventory.split(",")])
tokenizer = segments.Tokenizer(profile=prf, errors_replace=passthru)
print(tokenizer(sandawe, ipa=True))

>>> ǀ ’ ù s ù
LinguList commented 6 years ago

I think it's the confusing "ipa=True" statement, which means that the internal ipa-detector is used, which splits these things. So what you want to do is:

print(tokenizer(sandawe))

if you set this ipa-keyword to True, it no longer uses your tokenizer.

bambooforest commented 6 years ago

@SimonGreenhill - this should work if you use the proper IPA ejective character (i.e. not keyboard U+2019):

U+02BC -- MODIFIER LETTER APOSTROPHE <ʼ>

sandawe2 = 'ǀʼùsù' print(tokenizer(sandawe2, ipa=True)) ǀʼ ù s ù

Btw - this is because U+02BC belongs to the block of Letter Modifiers (Lm) like aspiration, where they can potentially go before or after the base segment. Here in the code:

https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L457

A regular keyboard <'> gets treated as a "regular" character.

SimonGreenhill commented 6 years ago

Thanks both. Hmm. I'd rather change it in the profile than in the raw data. Is this possible?

bambooforest commented 6 years ago

Can you first transform keyboard <'> to IPA <ʼ> with the OP before tokenizing?

LinguList commented 6 years ago

You could just normalize with CLTS, as we have this already in our normalization procedure:

>>> from pyclts.clts import *
>>> bipa = CLTS()
>>> bipa.normalize('t’') == 't’'
False
bambooforest commented 6 years ago

Just wanted to add here that the ipa=True works as intended -- give it valid Unicode IPA and the input is returned tokenized correctly.

Valid/strict IPA is listed here starting on page 68:

https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf

I will add some machine readable tables (as requested by a reviewer) for these tables to a cookbook/data/ directory.

I'm not sure what pyclts.clts is -- a quick search doesn't show anything relevant.

Note you might be weary in cases where the input data uses the keyboard apostrophe for both the ejective and as an apostrophe (sounds strange, but I've seen stranger -- consider the use of they keyboard "!" as a click and as an exclamation! Together they are very impressive. :)

LinguList commented 6 years ago

Unfortunatel, you weren't able to make it to Poznan, but there, I presented CLTS here. A web-demo can be found here. We distinguish different steps of normalization for what users think is IPA: input to NFD, normalization by lookalikes, normalization by aliases, and finally an algorithm that generates potential sounds from the constituents, if it is not yet in our database. We plan to link to Phoible, PBase, and Ruhlen's data, but also register different transcription systems. Code is currently offline, but clpa was a predecessor.

Just wanted to add here that the ipa=True works as intended -- give it valid Unicode IPA and the input is returned tokenized correctly.

The question is: what IS valid Unicode IPA? Do you treat "ts" as one sound? No, do you distinguish pre- and post-aspiration? No, because it's impossible. and given that @SimonGreenhill already had the orthography profile, segmentation as if the string WAS IPA is of course not needed, as it is already segmented.

The IPA-keyword is probably just misleading here, also the doc-string, as one may think that this will convert to IPA, although it is not converting but just segmenting IFF the underlying data corresponds to the dialect that you define as regular IPA. And even there nobody can do miracles with the normal ambiguities of IPA. It is also inconsistent, as you register a profile and then segment, following another profile. Ideally, this should only work with an empty profile.

bambooforest commented 6 years ago

Thanks for raising these concerns! They are all described in the cookbook:

https://github.com/unicode-cookbook/cookbook

In the case of \<ts>, valid IPA would use the tie bar, as specified in the IPA handbook.

Indeed pre/post aspiration is a difficult problem, hence the warning "work in progress" in the code:

https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L229

Preaspiration is exceedingly rare, but if it's word initial this code catches it. Otherwise, we could add some rules to make sure that it's not added to vowels, etc.

SimonGreenhill commented 6 years ago

Ok, I could pre-convert, but I'm trying to analyse the JIPA article series and see if their phoneme inventories match the text transcripts they have. The keyboard <'> is in the transcript and I'd like to be able to capture that without converting it (i.e. I want to know how many mismatches and of what kind the mismatches are). Is this possible?

bambooforest commented 6 years ago

I'm not sure how (of if) you digitized the PDF passages (e.g. OCR or had them typed up by hand), but in our experience it was easier to use the valid IPA symbols like apostrophe, or <ɡ> instead of keyboard , because we expect that JIPA meant the correct Unicode IPA, but who knows what happens when these symbols go through the typesetting process for publication. The point here is that they visually represent IPA in published form.

This caveat doesn't include when they use IPA symbols or combinations that go against their own principles, which is what it sounds like you're after. For example, Serer [sere1260, srr] has contrastive voiceless implosives, but they are marked in the article by voiced implosives with voiceless diacritics (diacritics are typically for denoting allophonic variation).

It seems simplest to just batch convert the passages from keyboard apostrophe to IPA apostrophe because you're probably going to encounter other visual errors like this.

Alternatively, you could update the tokenization code to handle keyboard apostrophe as an exception (and anything else that comes your way), here:

https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L457

The method iterates backwards through a form and if it catches a segment that is in the class of Unicode Letter Modifiers (like aspiration, ejectives, etc.), it places them next to the base character (rare exception mentioned by Mattis above).

Note also that you might want to drop this line of code:

https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L458

that lets the primary and secondary stress marks "float", i.e. the aren't placed next to segment/syllable/word but occur between words, e.g.

n a ˦ h aː kʰ o ˈ s ɛː # j i j a ˈ ʃ ĩː ˦ #

Or perhaps use one of Mattis' procedures.

SimonGreenhill commented 6 years ago

Thanks for the detailed explanation and workarounds. I'll see how I go! Will close this issue now and reopen if needed.