CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
320 stars 71 forks source link

[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

Closed jhdeov closed 3 years ago

jhdeov commented 3 years ago

Hello

I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.

I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?

kylebgorman commented 3 years ago

Hi Hossep,

So first off, we release a regular dump of Wiktionary run over every language and dialect we know of. We are updating this at least annually. The Armenian ones were made this summer when we added separate "dialects" (Western and Eastern are sometimes distinguished). I plan to update the entire collection shortly. This "dump" is here. This seems to us a good compromise over using a frozen dump for extraction: for instance UniMorph does that and they've set it up so that it's impossible to advance beyond 2017! FWIW, I will kick off an overall "rescrape" next week (it runs for about a week, using a fast and reliable connection at my lab) and update this dumped data set.

FWIW the latency you see has little to do with "searching" in any relevant sense. The code hits the backend for a pre-computed list of all Armenian headwords (4k or so at a time), which takes no more than a few seconds. The vast majority of time is spent waiting for the server to serve up actual HTML of the pages for those headwords. Finally, our requests identify Wikipron as a bot (the polite thing to do) so the administrators can traffic-shape us if necessary.

The extraction does sometimes hang, though if it is actually a server hang it usually hangs up, which generates a fatal exception. I've not seen your particular pattern. Note that Python buffers IO (you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.) The script that generates the "dump" catches hangs and it generally runs in one go over about a week, from my office, with a very fast and reliable internet connection.

We have an open "enhancement" issue (#253) to make scrapes restartable. It's not clear how to do it but I think it's a good idea.

jhdeov commented 3 years ago

I see, thanks for the explanation.

To give more context, this what my terminal shows for the different languages

For example, when I try Amharic, it just works:

$ wikipron amh > amh.tsv INFO: Language: "Amharic" INFO: No cut-off date specified $

The outputted file has stuff in it

French also works and it uses the cut-off points

$ wikipron fra > fr.tsv INFO: Language: "French" INFO: No cut-off date specified INFO: 100 pronunciations scraped INFO: 200 pronunciations scraped INFO: 300 pronunciations scraped INFO: 400 pronunciations scraped INFO: 500 pronunciations scraped ...

But the Armenian just finishes in an hour and gives me an empty file

$ wikipron hye > hye1.tsv INFO: Language: "Armenian" INFO: No cut-off date specified $

Regarding this statement:

(you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.

Can you recommend any command to run? If the entire process would take a week for my residential internet connection, then I can just sit tight for your eventual rescrape :D

lfashby commented 3 years ago

Hello, Perhaps this behavior is due to all Armenian entries being ‘phonetic’ transcriptions? wikipron arm will run for about an hour and output nothing since no Armenian transcriptions are contained within /'s . Running wikipron arm —phonetic appears to work as intended.

kylebgorman commented 3 years ago

Oh yes, that is a factor. There are a few hundred phonemic transcriptions (that's the default) but they're few and far between.

To disable buffering from the command line I believe you can do 

PYTHONUNBUFFERED=1 your_python_command

(Not tested.)

jhdeov commented 3 years ago

Huh, cool. I'm not surprised.

The Armenian IPA entries are all done via a script that converts a cleaned-up version of the orthographic form, explained here.

For example, for the orthographic word անտերունչ, the IPA is [ɑntɛˈɾunt͡ʃʰ], but the edit-page just:

===Pronunciation===

  • {{audio|hy|Hy-անտերունչ.ogg|Audio}} {{hy-pron}}

I'm trying it now and it's working :D

$ wikipron arm > hye1.tsv --phonetic INFO: Language: "Armenian" INFO: No cut-off date specified INFO: 100 pronunciations scraped INFO: 200 pronunciations scraped ....

Though I thought that your code would extract both phonemic and phonetic transcriptions? Maybe I misunderstood the "Phonetic versus phonemic transcription" section of your paper.

Does your scraped-data set mark any of the phonemic transcriptions? Because if so, then their entries should be fixed by a Wiktionary user (like me)

kylebgorman commented 3 years ago

The way things are set up, we just scrape twice, once for phonemic (literally just whatever is in //) and once for phonetic ([]) and store them in separate files. Some languages have a lot of one and very few of the other; if a file ends up <100 entries, we don't include it in the "big scrape" database. But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

What do you mean by "mark any of the phonemic transcriptions"? It does extract them: the big scrape will have a file with a name like arm_e_phonemic.tsv for the Eastern dialect, and so on.

PS: I didn't realize you could have > foo before a flag and Python would know that worked. I would have written wikipron --phonetic arm > hye1.tsv and wouldn't have guessed what you did would work, but I guess > is truly parsed as a low-precedence operator, huh.

jhdeov commented 3 years ago

But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

But your dataset only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means).

PS: Yeah that surprised me too :D

kylebgorman commented 3 years ago

Oh no you're right, I misread this earlier.

On Tue, Dec 29, 2020 at 1:43 PM Hossep Dolatian notifications@github.com wrote:

But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

But your dataset https://github.com/kylebgorman/wikipron/tree/master/data/tsv only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means).

PS: Yeah that surprised me too :D

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752200135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OK2RJGJEYS3SLJ7YA3SXIPMPANCNFSM4VM5PX4A .

jhdeov commented 3 years ago

Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets?

kylebgorman commented 3 years ago

No, but I think it's obvious enough: it uses the appropriate phonelists in the phones directory to filter entries.

On Tue, Dec 29, 2020 at 1:50 PM Hossep Dolatian notifications@github.com wrote:

Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752202529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OO5Q26HM2QWOPJQ6WLSXIQHXANCNFSM4VM5PX4A .

jhdeov commented 3 years ago

Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting from and to?

kylebgorman commented 3 years ago

On Tue, Dec 29, 2020 at 1:58 PM Hossep Dolatian notifications@github.com wrote:

Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list https://github.com/kylebgorman/wikipron/blob/master/data/phones/arm_e_phonetic.phones? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting?

No. It's a filter. If any phone in the entry isn't present in the phonelist, we eliminate the pronunciation entry out in the corresponding "filtered" file. This happens during the "big scrape" procedure that generates the static database. It's not part of the command-line tool itself, but one could filter it that way if you want to.

For instance, Bach is given two PR pronunciations: /bɑːx, bɑːk/. We throw the former out in the filtered file because /x/ isn't a phoneme of modern English (sorry SPE). There's no transformations applied: we don't have a mechanism for that yet.

The only graphemic transformations that are applied are optional case-folding. For more information see data/src, the README and the code therein.

K

kylebgorman commented 3 years ago

Linking this to #253 and closing for now.

jhdeov commented 3 years ago

Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though.

kylebgorman commented 3 years ago

You're welcome to update. It's really only for maintainers though: 3 people have used that code so far. Also the filtering mechanism isn't "released" yet, we'll do that before long.

On Tue, Dec 29, 2020 at 2:06 PM Hossep Dolatian notifications@github.com wrote:

Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752207954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPTW6YRW6X463B74KTSXISDXANCNFSM4VM5PX4A .

jhdeov commented 3 years ago

Well I found the ACL paper pretty helpful in understanding almost everything about how to interpret (and realize) how correct your arm scrape was. The README covered some holes that weren't present in the ACL doc.

I would update the README myself but then I would fear saying something wrong :/

PS: "before long." is a new construction for me.