Closed jhdeov closed 3 years ago
Hi Hossep,
So first off, we release a regular dump of Wiktionary run over every language and dialect we know of. We are updating this at least annually. The Armenian ones were made this summer when we added separate "dialects" (Western and Eastern are sometimes distinguished). I plan to update the entire collection shortly. This "dump" is here. This seems to us a good compromise over using a frozen dump for extraction: for instance UniMorph does that and they've set it up so that it's impossible to advance beyond 2017! FWIW, I will kick off an overall "rescrape" next week (it runs for about a week, using a fast and reliable connection at my lab) and update this dumped data set.
FWIW the latency you see has little to do with "searching" in any relevant sense. The code hits the backend for a pre-computed list of all Armenian headwords (4k or so at a time), which takes no more than a few seconds. The vast majority of time is spent waiting for the server to serve up actual HTML of the pages for those headwords. Finally, our requests identify Wikipron as a bot (the polite thing to do) so the administrators can traffic-shape us if necessary.
The extraction does sometimes hang, though if it is actually a server hang it usually hangs up, which generates a fatal exception. I've not seen your particular pattern. Note that Python buffers IO (you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.) The script that generates the "dump" catches hangs and it generally runs in one go over about a week, from my office, with a very fast and reliable internet connection.
We have an open "enhancement" issue (#253) to make scrapes restartable. It's not clear how to do it but I think it's a good idea.
I see, thanks for the explanation.
To give more context, this what my terminal shows for the different languages
For example, when I try Amharic, it just works:
$ wikipron amh > amh.tsv INFO: Language: "Amharic" INFO: No cut-off date specified $
The outputted file has stuff in it
French also works and it uses the cut-off points
$ wikipron fra > fr.tsv INFO: Language: "French" INFO: No cut-off date specified INFO: 100 pronunciations scraped INFO: 200 pronunciations scraped INFO: 300 pronunciations scraped INFO: 400 pronunciations scraped INFO: 500 pronunciations scraped ...
But the Armenian just finishes in an hour and gives me an empty file
$ wikipron hye > hye1.tsv INFO: Language: "Armenian" INFO: No cut-off date specified $
Regarding this statement:
(you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.
Can you recommend any command to run? If the entire process would take a week for my residential internet connection, then I can just sit tight for your eventual rescrape :D
Hello,
Perhaps this behavior is due to all Armenian entries being ‘phonetic’ transcriptions? wikipron arm
will run for about an hour and output nothing since no Armenian transcriptions are contained within /
's . Running wikipron arm —phonetic
appears to work as intended.
Oh yes, that is a factor. There are a few hundred phonemic transcriptions (that's the default) but they're few and far between.
To disable buffering from the command line I believe you can do
PYTHONUNBUFFERED=1 your_python_command
(Not tested.)
Huh, cool. I'm not surprised.
The Armenian IPA entries are all done via a script that converts a cleaned-up version of the orthographic form, explained here.
For example, for the orthographic word անտերունչ, the IPA is [ɑntɛˈɾunt͡ʃʰ], but the edit-page just:
===Pronunciation===
- {{audio|hy|Hy-անտերունչ.ogg|Audio}} {{hy-pron}}
I'm trying it now and it's working :D
$ wikipron arm > hye1.tsv --phonetic INFO: Language: "Armenian" INFO: No cut-off date specified INFO: 100 pronunciations scraped INFO: 200 pronunciations scraped ....
Though I thought that your code would extract both phonemic and phonetic transcriptions? Maybe I misunderstood the "Phonetic versus phonemic transcription" section of your paper.
Does your scraped-data set mark any of the phonemic transcriptions? Because if so, then their entries should be fixed by a Wiktionary user (like me)
The way things are set up, we just scrape twice, once for phonemic (literally just whatever is in //) and once for phonetic ([]) and store them in separate files. Some languages have a lot of one and very few of the other; if a file ends up <100 entries, we don't include it in the "big scrape" database. But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.
What do you mean by "mark any of the phonemic transcriptions"? It does extract them: the big scrape will have a file with a name like arm_e_phonemic.tsv
for the Eastern dialect, and so on.
PS: I didn't realize you could have > foo before a flag and Python would know that worked. I would have written wikipron --phonetic arm > hye1.tsv
and wouldn't have guessed what you did would work, but I guess > is truly parsed as a low-precedence operator, huh.
But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.
But your dataset only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means).
PS: Yeah that surprised me too :D
Oh no you're right, I misread this earlier.
On Tue, Dec 29, 2020 at 1:43 PM Hossep Dolatian notifications@github.com wrote:
But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.
But your dataset https://github.com/kylebgorman/wikipron/tree/master/data/tsv only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means).
PS: Yeah that surprised me too :D
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752200135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OK2RJGJEYS3SLJ7YA3SXIPMPANCNFSM4VM5PX4A .
Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets?
No, but I think it's obvious enough: it uses the appropriate phonelists in
the phones
directory to filter entries.
On Tue, Dec 29, 2020 at 1:50 PM Hossep Dolatian notifications@github.com wrote:
Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752202529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OO5Q26HM2QWOPJQ6WLSXIQHXANCNFSM4VM5PX4A .
Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting from and to?
On Tue, Dec 29, 2020 at 1:58 PM Hossep Dolatian notifications@github.com wrote:
Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list https://github.com/kylebgorman/wikipron/blob/master/data/phones/arm_e_phonetic.phones? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting?
No. It's a filter. If any phone in the entry isn't present in the phonelist, we eliminate the pronunciation entry out in the corresponding "filtered" file. This happens during the "big scrape" procedure that generates the static database. It's not part of the command-line tool itself, but one could filter it that way if you want to.
For instance, Bach is given two PR pronunciations: /bɑːx, bɑːk/. We throw the former out in the filtered file because /x/ isn't a phoneme of modern English (sorry SPE). There's no transformations applied: we don't have a mechanism for that yet.
The only graphemic transformations that are applied are optional case-folding. For more information see data/src, the README and the code therein.
K
Linking this to #253 and closing for now.
Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though.
You're welcome to update. It's really only for maintainers though: 3 people have used that code so far. Also the filtering mechanism isn't "released" yet, we'll do that before long.
On Tue, Dec 29, 2020 at 2:06 PM Hossep Dolatian notifications@github.com wrote:
Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752207954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPTW6YRW6X463B74KTSXISDXANCNFSM4VM5PX4A .
Well I found the ACL paper pretty helpful in understanding almost everything about how to interpret (and realize) how correct your arm scrape was. The README covered some holes that weren't present in the ACL doc.
I would update the README myself but then I would fear saying something wrong :/
PS: "before long." is a new construction for me.
Hello
I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.
I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?