CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
320 stars 71 forks source link

[hbs] Serbo-Croatian support #62

Closed lfashby closed 4 years ago

lfashby commented 5 years ago

Wiktionary only lists Serbo-Croatian and does not have separate categories for Serbian, Croatian, Montenegrin, and Bosnian - the languages that the ISO 693-3 [hbs] code encapsulates.

Wikipron can handle scraping if given [hbs] but the resulting tsv files may not be of much use. [hbs] tsv files are therefore not included in pull request #61 .

jacksonllee commented 5 years ago

Wikipron can handle scraping if given [hbs] but the resulting tsv files may not be of much use.

I've randomly picked Serbo-Croatian entries (like agregatni) to get a sense of the problem. It doesn't look like there's any indication to tell which Serbo-Croatian variety(ies) an entry like this belongs to. Would including these entries in a file like hbs_phonemic.tsv (following #61) for Serbo-Croatian be problematic? I may be missing something here.

Or is the issue about Latin vs Cyrillic orthography? I see from the Serbo-Croatian list that some entries have the head word spelled in Latin (and have the Cyrillic spelling given inside the page), while others have the head word spelled in Cyrillic (and have the Latin spelling given inside the page). Would getting either Latin or Cyrillic spelling consistently as the graphemes solve the problem?

kylebgorman commented 5 years ago

The issue is just the orthography, yes. There is no meaningful difference between "Serbian" and "Croatian"; one is just written in Cyrillic and the other in Latin. If we had some way to split the data into Cyrillic and Latin that would suffice. I don't know if this goes in WikiPron or if it's a post-processing trick to be done at the end of the "big scrape" that Lucas has been working on.

But I think we should fix it for our big debut.

lfashby commented 5 years ago

I see Jackson has assigned himself to this, but I just thought I'd comment.

If we had some way to split the data into Cyrillic and Latin that would suffice. I don't know if this goes in WikiPron or if it's a post-processing trick to be done at the end of the "big scrape" that Lucas has been working on.

Splitting the data as a post-processing trick would be quite straightforward as far as I can tell. Is relying on post-processing tricks something we'd like to avoid?

kylebgorman commented 5 years ago

I'm not inherently opposed to post-processing of this type.

Japanese is a not-unrelated case---the only thing standard g2p will ever work on are words that are written solely in kana characters, but the dictionaries give us kanji and kanji-kana-mixed words too.

Do these have enough in common, and an obvious interface that we can build into WikiPron?

Or do we have to do them purely as post-processing? If so what's the interface for said post-processing look like?

On Thu, Oct 17, 2019 at 7:09 PM Lucas Ashby notifications@github.com wrote:

I see Jackson has assigned himself to this, but I just thought I'd comment.

If we had some way to split the data into Cyrillic and Latin that would suffice. I don't know if this goes in WikiPron or if it's a post-processing trick to be done at the end of the "big scrape" that Lucas has been working on.

Splitting the data as a post-processing trick would be quite straightforward as far as I can tell. Is relying on post-processing tricks something we'd like to avoid?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/62?email_source=notifications&email_token=AABG4OIJ5KAS5RL4QYI7MPLQPDWC5A5CNFSM4I4RE6W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBR2KEQ#issuecomment-543401234, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OI3GFSJKLNQHDAYW4DQPDWC5ANCNFSM4I4RE6WQ .

lfashby commented 5 years ago

I'm not sure there is one solution that can handle both. Unlike with Serbo-Croatian, the Japanese case seems to lend itself to a Wikipron solution. Japanese entries on Wiktionary that contain kanji appear to always contain a table listing the kanji used. Wikipron could skip scraping entries that contain those tables?

Adding post-processing to the 'big scrape' workflow might involve setting a post-processing value/flag in languages.json for Japanese and Serbo-Croatian specifying that a post-processing function should be invoked after the tsvs are generated in scrape.py. The function would then filter the Japanese tsvs or split up the Serbo-Croatian tsvs through checking the unicode values of the characters in each entry?

kylebgorman commented 5 years ago

I'm fine with that if it can be worked out. (Not sure if it's a function or it's like, a command to run from the shell?)

For Japanese I don't think looking for the Kanji table is the right solution: we have this instead? https://github.com/kylebgorman/wikipron/blob/master/languages/jpn_wik/scrape.py (Alan wrote this I believe, from an early sketchy draft by yours truly.)

On Fri, Oct 18, 2019 at 3:24 PM Lucas Ashby notifications@github.com wrote:

I'm not sure there is one solution that can handle both. Unlike with Serbo-Croatian, the Japanese case seems to lend itself to a Wikipron solution. Japanese entries https://en.wiktionary.org/wiki/%E5%BC%B7%E3%81%84%E3%82%8B#Japanese on Wiktionary that contain kanji appear to always contain a table listing the kanji used. Wikipron could skip scraping entries that contain those tables?

Adding post-processing to the 'big scrape' workflow might involve setting a post-processing value/flag in languages.json for Japanese and Serbo-Croatian specifying that a post-processing function should be invoked after the tsvs are generated in scrape.py. The function would then filter the Japanese tsvs or split up the Serbo-Croatian tsvs through checking the unicode values of the characters in each entry?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/62?email_source=notifications&email_token=AABG4OJITQF72PO7CZTUUVLQPIENRA5CNFSM4I4RE6W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBVT5NA#issuecomment-543899316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OOMATZF6VPTWZSIXI3QPIENRANCNFSM4I4RE6WQ .

jacksonllee commented 5 years ago

Late to the party... I assigned this ticket to myself as I thought I should have time to get to fixing/getting Serbo-Croatian data soon-ish (after knocking out some modeling work). I think a post-processing approach to splitting results into a Cyrillic TSV and a Latin TSV is reasonable, at least as a first pass for the upcoming WikiPron release. The interface or how the post-processing is triggered does need a bit of thinking. Please feel free to keep leaving comments here if you have specific thoughts. I'll give it a whirl and see how it goes, including trying out some of the ideas you both have brought up above. Alternatively, Lucas, since you're naturally more familiar with the big scrape code, please feel free to let me know if you already have a plan or other upstream changes that would open up a better solution down the stream for post-processing, in which case I'll wait for your signal.

Re: Japanese, I've just opened #73.

kylebgorman commented 5 years ago

SGTM, thanks Jackson.

lfashby commented 5 years ago

Alternatively, Lucas, since you're naturally more familiar with the big scrape code, please feel free to let me know if you already have a plan or other upstream changes that would open up a better solution down the stream for post-processing, in which case I'll wait for your signal.

I've got nothing at the moment. It'd be great if codes.py somehow told us which languages will cause us difficulty (Wiktionary entries are in a different format than we're expecting, there are multiple scripts) but I don't really know how feasible that would be to implement.