Closed SimonGreenhill closed 6 years ago
I think it's the confusing "ipa=True" statement, which means that the internal ipa-detector is used, which splits these things. So what you want to do is:
print(tokenizer(sandawe))
if you set this ipa-keyword to True, it no longer uses your tokenizer.
@SimonGreenhill - this should work if you use the proper IPA ejective character (i.e. not keyboard U+2019):
U+02BC -- MODIFIER LETTER APOSTROPHE <ʼ>
sandawe2 = 'ǀʼùsù' print(tokenizer(sandawe2, ipa=True)) ǀʼ ù s ù
Btw - this is because U+02BC belongs to the block of Letter Modifiers (Lm) like aspiration, where they can potentially go before or after the base segment. Here in the code:
https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L457
A regular keyboard <'> gets treated as a "regular" character.
Thanks both. Hmm. I'd rather change it in the profile than in the raw data. Is this possible?
Can you first transform keyboard <'> to IPA <ʼ> with the OP before tokenizing?
You could just normalize with CLTS, as we have this already in our normalization procedure:
>>> from pyclts.clts import *
>>> bipa = CLTS()
>>> bipa.normalize('t’') == 't’'
False
Just wanted to add here that the ipa=True
works as intended -- give it valid Unicode IPA and the input is returned tokenized correctly.
Valid/strict IPA is listed here starting on page 68:
https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf
I will add some machine readable tables (as requested by a reviewer) for these tables to a cookbook/data/ directory.
I'm not sure what pyclts.clts is -- a quick search doesn't show anything relevant.
Note you might be weary in cases where the input data uses the keyboard apostrophe for both the ejective and as an apostrophe (sounds strange, but I've seen stranger -- consider the use of they keyboard "!" as a click and as an exclamation! Together they are very impressive. :)
Unfortunatel, you weren't able to make it to Poznan, but there, I presented CLTS here. A web-demo can be found here. We distinguish different steps of normalization for what users think is IPA: input to NFD, normalization by lookalikes, normalization by aliases, and finally an algorithm that generates potential sounds from the constituents, if it is not yet in our database. We plan to link to Phoible, PBase, and Ruhlen's data, but also register different transcription systems. Code is currently offline, but clpa was a predecessor.
Just wanted to add here that the ipa=True works as intended -- give it valid Unicode IPA and the input is returned tokenized correctly.
The question is: what IS valid Unicode IPA? Do you treat "ts" as one sound? No, do you distinguish pre- and post-aspiration? No, because it's impossible. and given that @SimonGreenhill already had the orthography profile, segmentation as if the string WAS IPA is of course not needed, as it is already segmented.
The IPA-keyword is probably just misleading here, also the doc-string, as one may think that this will convert to IPA, although it is not converting but just segmenting IFF the underlying data corresponds to the dialect that you define as regular IPA. And even there nobody can do miracles with the normal ambiguities of IPA. It is also inconsistent, as you register a profile and then segment, following another profile. Ideally, this should only work with an empty profile.
Thanks for raising these concerns! They are all described in the cookbook:
https://github.com/unicode-cookbook/cookbook
In the case of \<ts>, valid IPA would use the tie bar, as specified in the IPA handbook.
Indeed pre/post aspiration is a difficult problem, hence the warning "work in progress" in the code:
https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L229
Preaspiration is exceedingly rare, but if it's word initial this code catches it. Otherwise, we could add some rules to make sure that it's not added to vowels, etc.
Ok, I could pre-convert, but I'm trying to analyse the JIPA article series and see if their phoneme inventories match the text transcripts they have. The keyboard <'> is in the transcript and I'd like to be able to capture that without converting it (i.e. I want to know how many mismatches and of what kind the mismatches are). Is this possible?
I'm not sure how (of if) you digitized the PDF passages (e.g. OCR or had them typed up by hand), but in our experience it was easier to use the valid IPA symbols like apostrophe, or <ɡ> instead of keyboard
This caveat doesn't include when they use IPA symbols or combinations that go against their own principles, which is what it sounds like you're after. For example, Serer [sere1260, srr] has contrastive voiceless implosives, but they are marked in the article by voiced implosives with voiceless diacritics (diacritics are typically for denoting allophonic variation).
It seems simplest to just batch convert the passages from keyboard apostrophe to IPA apostrophe because you're probably going to encounter other visual errors like this.
Alternatively, you could update the tokenization code to handle keyboard apostrophe as an exception (and anything else that comes your way), here:
https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L457
The method iterates backwards through a form and if it catches a segment that is in the class of Unicode Letter Modifiers (like aspiration, ejectives, etc.), it places them next to the base character (rare exception mentioned by Mattis above).
Note also that you might want to drop this line of code:
https://github.com/cldf/segments/blob/master/src/segments/tokenizer.py#L458
that lets the primary and secondary stress marks "float", i.e. the aren't placed next to segment/syllable/word but occur between words, e.g.
n a ˦ h aː kʰ o ˈ s ɛː # j i j a ˈ ʃ ĩː ˦ #
Or perhaps use one of Mattis' procedures.
Thanks for the detailed explanation and workarounds. I'll see how I go! Will close this issue now and reopen if needed.
I've been staring at this for 30m but can't figure out why the dental click (
ǀ'
, where "ǀ" = Latin Letter Dental Click) is being split into two graphemes ('|' and '’')