CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
322 stars 71 forks source link

Potential problem in _parse_combining_modifiers() #83

Closed lfashby closed 5 years ago

lfashby commented 5 years ago

I started the second big scrape and while scraping for phonetic data from Albanian, Wikipron threw an error, the last line of which I'll reproduce below:

File ".../wikipron/config.py", line 73, in _parse_combining_modifiers last_char = chars.pop() IndexError: pop from empty list

The final line in the Albanian phonetic tsv is herë h ɛː ɾ meaning the scrape likely failed on this entry which contains what looks like word initial aspiration.

I guess for words like the one that caused this error we would want to combine with next char ʰi d r ɔ ɟ ɛ n?

jacksonllee commented 5 years ago

I spot-checked a bunch of Albanian entries that orthographically begins with "h", and so far I can only see that particular entry Lucas pointed out that has this initial aspirated h diacritic in the pronunciation. Could this be just an error from Wiktionary? (I also see quite a number of entries with /(h).../ in pronunciation, with parentheses around h pronunciation-initially, like this entry -- not sure if this is related to the current problem though.)

Lucas, if this is a current blocker for the big scrape, do you think you have another branch for now and wrap the following statement in a try-except block for now (this is where _parse_combining_modifiers would be called during the scrape) to catch IndexError specifically:

https://github.com/kylebgorman/wikipron/blob/3abd38f6ad0299a1325d84952ff9f665cd97b8e3/wikipron/scrape.py#L50

and within the except part do the following:

lfashby commented 5 years ago

This error was causing the big scrape to stop, so I made the changes you suggested.

The try-except block with logging works well, I also caught humb while scraping Albanian. I guess this is just how these words were entered into Wiktionary. Is this just a misuse of the aspiration diacritic? I didn't think you could have aspiration that wasn't linked to a preceding phone.

kylebgorman commented 5 years ago

I don't know if it violates a rule of IPA transcription but certainly I've never seen such a thing before in my life and it seems wrong. That said it ought to be possible to make _parse_combining_modifiers sufficiently robust to this---it just needs to check that the stack isn't empty before popping. Feel free to make this issue re: that!

m-sean commented 5 years ago

According to wikipedia presapiration is an uncommon phenomenon found in some languages, but it's not always transcribed with a diacritic (sometimes it's just an 'h'+obstruent cluster). This means that _parse_combining_modifiers will occasionally parse modifier diacritics incorrectly, especially if they are not at the beginning of the transcription. This is also a problem for things like prenasalization, prepalatalization, prevelarization, etc. Should we just stop parsing unicode modifiers with a preceding characters, or make a first-char fix for this problem?

lfashby commented 5 years ago

Is this an issue that needs to be resolved before running the next big scrape, or should I just go ahead and run the big scrape while logging instances in which we find first-char diacritics?

As Sean mentioned, parse_combining_modifiers will parse diacritics incorrectly if given a word like this from Faroese /ˈkʰoːʰkʊsˌnøːʰt/. It doesn't seem like there are any easy solutions to this problem (that can also handle all the other pre-phone diacritic usages Sean mentioned), but I'm also not sure if this is an actual problem or just an interesting side effect of the potentially ambiguous nature of IPA diacritics (because they may be able to apply to the preceding or following segment -- though not in the Faroese example above).

kylebgorman commented 5 years ago

Please kick things off ASAP and we can resolve this separately.

On Thu, Oct 31, 2019 at 11:10 AM Lucas Ashby notifications@github.com wrote:

Is this an issue that needs to be resolved before running the next big scrape, or should I just go ahead and run the big scrape while logging instances in which we find first-char diacritics?

As Sean mentioned, parse_combining_modifiers will parse diacritics incorrectly if given a word like this from Faroese https://en.wiktionary.org/wiki/kokusn%C3%B8t#Faroese /ˈkʰoːʰkʊsˌnøːʰt/. It doesn't seem like there are any easy solutions to this problem (that can also handle all the other pre-phone diacritic usages Sean mentioned), but I'm also not sure if this is an actual problem or just an interesting side effect of the potentially ambiguous nature of IPA diacritics (because they may be able to apply to the preceding or following segment -- though not in the Faroese example above).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/83?email_source=notifications&email_token=AABG4OKVRI24EB6YIVGG2RTQRLYPZA5CNFSM4JGTIIQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECYEHGY#issuecomment-548422555, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLAQHT6WJ3VQHVDZQ3QRLYPZANCNFSM4JGTIIQQ .

kylebgorman commented 5 years ago

Am trying to figure out what's affected here:

#!/usr/bin/env python

import fileinput

# This is my guess for possible pre-segments...
bad = frozenset("ʰ ʷ ʲ ˠ ˤ ˡ ⁿ".split())

for line in fileinput.input():
    (grapheme, phoneme) = line.rstrip().split("\t", 1)
    if phoneme[0] in bad:
        print(phoneme)
        print(f"(file {fileinput.filename()}, line {fileinput.filelineno()})")

Here's what I get:

ʰidrɔɟɛn
(file alb_phonetic.tsv, line 230)
ʰumb
(file alb_phonetic.tsv, line 233)
ʲsɛdmnaːt͡stiː
(file cze_phonemic.tsv, line 2476)
ˤ
(file mlt_phonemic.tsv, line 381)
ʲ
(file rus_phonetic.tsv, line 385004)
ʷombʌd̪ɯ
(file tam_phonetic.tsv, line 269)
ʷoɳɳɯ
(file tam_phonetic.tsv, line 271)

So I would propose that we remove at least some of those in my bad from the list of segments which are combining. What do you all think?

lfashby commented 5 years ago

Also logged these while scrapping. From Chichewa (logged about 45 of these, most of the ˈⁿ type) :

ⁿdaˈɽá.ma
ˈⁿda.ní

From Bulgarian (2 of these): "уволня" "̪ovoɫˈnʲɤ" Here is the Bulgarian entry on Wiktionary. (I can't separate the diacritic from the qutotations)

jacksonllee commented 5 years ago

The ⁿd is probably legit as a prenasalized stop, as Chichewa is a Bantu language. This is beautiful. Thank you for showing it to us, Lucas!

I'm inclined to think a quick and good-enough fix for this ticket is to handle chars.pop() for potential IndexError. If the error is thrown, keep the char in a separate variable, and when we see the next char, we simply chars.append(f"{previous_char}{char}") and reset previous char to empty string. This should resolve all cases (like Chichewa) where a word-initial diacritic is legit.

What would be the catch? This approach would still get us the Albanian ʰi d r ɔ ɟ ɛ n and such, which we're almost certain is problematic. That said, based on Kyle's comment, it doesn't look like it's a widespread issue from the last scrape. Pouring in more dev to separate ʰi etc. as two symbols (or even correct it to h i) would seem not worth it at the moment. We could have a ticket up to keep track of this.

So I would propose that we remove at least some of those in my bad from the list of segments which are combining.

Say we remove ʰ from the list. Would this mean we wouldn't correctly get aspirated stops anymore (like kʰ æ t)? Or am I misinterpreting what you mean, Kyle?

m-sean commented 5 years ago

I don't disagree with the fix, but isn't this a greater issue for the transcriptions if things like prenasalization/preaspiration occur after the first segment (e.g., Lucas's Faroese—/ˈkʰoːʰkʊsˌnøːʰt/)? Or is that not a big concern for the data? Also Lucas's Bulgarian example definitely looks like another error—a unicode combining character at the beginning of the transcription seems strange.

kylebgorman commented 5 years ago

One minor disagreement: in this particular case, it's better to check whether the stack is empty than catch the exception. (When we're talking about more expensive lookups, like in a dictionary, I can go either way.)

On Fri, Nov 1, 2019 at 1:24 PM Sean Miller notifications@github.com wrote:

I don't disagree with the fix, but isn't this a greater issue for the transcriptions if things like prenasalization/perspiration occur after the first segment (e.g., Lucas's Faroese—/ˈkʰoːʰkʊsˌnøːʰt/)? Or is that not a big concern for the data? Also Lucas's Bulgarian example definitely looks like another error—a unicode combining character at the beginning of the transcription seems strange.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/83?email_source=notifications&email_token=AABG4OPLC2IGQFV7VYNQ2G3QRRQ6JA5CNFSM4JGTIIQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC3TB2I#issuecomment-548876521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OOHKY2HF7YX4MZA5IDQRRQ6JANCNFSM4JGTIIQQ .

jacksonllee commented 5 years ago

Oh right, I forgot about the more recent comments in the middle. Revising and adding more thoughts --

  1. For a transcription-initial diacritic, if we don't want a very involved solution, there's not much we could do other than just prepending it to the first non-diacritic char. So this would "resolve" the issue for Albanian and Bulgarian. (And I like Kyle's suggestion that we can just check whether chars is empty.)

  2. Are there diacritics that (almost) always prepend to a char? (I'm thinking of the prenasalization diacritics, for example.) I wonder if we could have a separate list for these and handle them as prepending a char (as opposed to the default appending behavior of _parse_combining_modifiers()). This would cover, say, prenasalized consonants in Bantu (including the non-word-initial prenasalization diacritics).

  3. The tricky part is the diacritics that could either append or prepend a char, like ʰ for aspiration. I wonder if we can iterate with a bigram char window, and when we hit (X, ʰ) and X is a vowel, then ʰ shouldn't append to X but prepend the next char instead. Now how would WikiPron know the consonant vs vowel distinction? Hard-code a list(s) or pull from somewhere (e.g., possibly https://github.com/notnami/phonemes? not sure). A variant of this approach is to loop through trigrams of chars for more fine-grained control: with (X, diacritic, Y), decide whether we want segmentation between X and diacritic, and between diacritic and Y -- bonus point for "fixing" the diacritic if applicable. A simple machine-learning-y solution is also possible, but it might be overkill. (If we don't get to resolving the aspiration and other similarly ambiguous diacritics for this year's data release, I'd be fine with it.)

Increasingly, I think we're somewhat close to the word segmentation problem....

kylebgorman commented 5 years ago

PanPhon has language-specific information about diacritics. Screenshot from the 2016 COLING paper. But like Phoible I'm not immediately sure how to exploit it... ss

jacksonllee commented 5 years ago

Just came across this by chance. When Config takes no params (except for the obligatory key), _parse_combining_modifiers() also hits the IndexError: pop from empty list when the pron begins with a stress symbol.

In [1]: import wikipron

In [2]: config = wikipron.Config(key="eng")

In [3]: config.process_pron("ˌæb.oʊˈmaɪ.sɪn")  # from English
IndexError: pop from empty list

I think we didn't catch this before because we almost always drop the stress and syllable boundary marks. Compare with this:

In [5]: config = wikipron.Config(key="eng", no_stress=True)

In [6]: config.process_pron("ˌæb.oʊˈmaɪ.sɪn")
Out[6]: 'æ b . o ʊ m a ɪ . s ɪ n'
kylebgorman commented 5 years ago

Good catch.

On Tue, Nov 5, 2019 at 5:30 PM Jackson L. Lee notifications@github.com wrote:

Just came across this by chance. When Config takes no params (except for the obligatory key), _parse_combining_modifiers() also hits the IndexError: pop from empty list when the pron begins with a stress symbol.

In [1]: import wikipron

In [2]: config = wikipron.Config(key="eng")

In [3]: config.process_pron("ˌæb.oʊˈmaɪ.sɪn") # from English IndexError: pop from empty list

I think we didn't catch this before because we almost always drop the stress and syllable boundary marks. Compare with this:

In [5]: config = wikipron.Config(key="eng", no_stress=True)

In [6]: config.process_pron("ˌæb.oʊˈmaɪ.sɪn")

Out[6]: 'æ b . o ʊ m a ɪ . s ɪ n'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/83?email_source=notifications&email_token=AABG4ON6NUAS2JJ37WGCEW3QSHXW7A5CNFSM4JGTIIQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDESEYY#issuecomment-550052451, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OINA65J4RPA3I4PZP3QSHXW7ANCNFSM4JGTIIQQ .

kylebgorman commented 5 years ago

Chapter 5 of this free book has some details about ambiguity inherent in IPA transcriptions: http://langsci-press.org/catalog/book/176

kylebgorman commented 5 years ago

Okay, having read a bit further they recommend a (pip-installable) package called segments. Examples follow:

import segments
tokenizer = segments.Tokenizer()       

print(tokenizer("kʰoːʰkʊsnøːʰt", ipa=True))                                                                                                               
# Prints: 'kʰ oːʰ k ʊ s n øːʰ t'
print(tokenizer("ʰidrɔɟɛn", ipa=True))
# Prints: 'ʰi d r ɔ ɟ ɛ n'

Is this the solution to our problem?!

jacksonllee commented 5 years ago

Wow that's a nice find! My vote would be to delegate the IPA segmentation to segments or any other easily available package, even if it may not be perfect (e.g., the segmented kʰ oːʰ k ʊ s n øːʰ t might not be ideal for preaspiration in Faroese).

lfashby commented 5 years ago

Looks pretty good. I did some testing too.

import segments
tokenizer = segments.Tokenizer()

# Handles multiple post diacritics well
print(tokenizer("kʰaiɲtʰʲ", ipa=True)) # kʰ a i ɲ tʰʲ

# Doesn't handle prenazalisation well, but handles tie-bars
print(tokenizer("uˈpɛ.ⁿdɔt͡s", ipa=True)) # u ˈp ɛ .ⁿ d ɔ t͡s

# Handles word-initial diacritics
print(tokenizer("ʷoˈtɤu̯", ipa=True)) # ʷo ˈt ɤ u̯

It looks like you can set 'profiles' to help define how the tokenization should work. I'm not sure how robust the profiles can be but maybe it's an easy way of specifying the environments in which preaspiration or prenasalization occur in specific languages.

Though I've read that diacritic-phone-diactric or diactric-diactric-phone sequences exist, I've yet to see one on Wiktionary or in our data (I've only seen phone-diactric-diacritic). Would it be worth writing something to check our data for those sequences?

Edit: Judging by one of the tickets in their issue page, it looks like they are aware of the prenasalization/preaspiration issue as well and recommend using profiles.

kylebgorman commented 5 years ago

So what do you all think? @m-sean? Should we use this instead? Just tested:

tokenize = functools.partial(segments.Tokenizer(), ipa=True)
m-sean commented 5 years ago

Yeah, this looks like a simpler way to handle segmentation and I doubt I could come up with a rule-based method that’s sufficiently better. Sounds like a fun ML project to look into though ;). Another alternative could be to just pad all characters with whitespace, but it’s not clear to me if that would be preferable over a potential mixed bag of correct/incorrectly parsed segments.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Kyle Gorman notifications@github.com Sent: Wednesday, November 6, 2019 10:11:15 AM To: kylebgorman/wikipron wikipron@noreply.github.com Cc: Sean Miller smiller6@gradcenter.cuny.edu; Mention mention@noreply.github.com Subject: Re: [kylebgorman/wikipron] Potential problem in _parse_combining_modifiers() (#83)

So what do you all think? @m-seanhttps://github.com/m-sean? Should we use this instead? Just tested:

tokenize = functools.partial(segments.Tokenizer(), ipa=True)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kylebgorman/wikipron/issues/83?email_source=notifications&email_token=AKKGKCWUMBQHPN3KN66OZHTQSLNBHA5CNFSM4JGTIIQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDG3UEI#issuecomment-550353425, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKKGKCUGD63QEHFUM7E3RA3QSLNBHANCNFSM4JGTIIQQ.

kylebgorman commented 5 years ago

Okay, reassigning this to myself.