Closed jhdeov closed 1 year ago
Probably, and at least one person has suggested it would be useful to them. (I myself don't have a use yet but I like the idea.)
I wonder if this would exceed what we can store on GitHub directly (just in terms of overall repo size, I think the limit is 5 GB), though, and if so we would have to do something like make a local download then upload to, IDK, S3 or something like that and generate a link.
On Mon, Jun 20, 2022 at 7:47 PM Hossep Dolatian @.***> wrote:
Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation https://en.wiktionary.org/wiki/%D5%A3%D6%80%D5%A5%D5%AC section.
— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/wikipron/issues/466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJEZDPIFZU7XORZJILVQEUMVANCNFSM5ZKXAIRA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of word+transcription+URL, and b) the script bulk-downloads the audio files of the URL.
For example, the Wiktionary page of this word has a link to an audio file.
PS: this could be useful for someone trying out ASR using Wiktionary :D
That's a good idea.
The person who is probably most in the market for this is Alan Black at CMU.
On Wed, Jun 22, 2022 at 4:16 PM Hossep Dolatian @.***> wrote:
A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of worth+transcription+URL, and b) the script bulk-downloads the audio files of the URL.
For example, the Wiktionary page of this word https://en.m.wiktionary.org/wiki/%D5%A3%D6%80%D5%A5%D5%AC has a link https://en.m.wiktionary.org/wiki/File:Hy-%D5%A3%D6%80%D5%A5%D5%AC.ogg to an audio file.
I know that the way Wiktionary works is that users independently upload their audio recordings into some Wiki site. And then Wiktionary links a Wiktionary entry (if it exists) to the audio file (if it exists). That way, a user can make an audio file let's say today, but then next week they make the Wiktionary entry, and then Wiktionary will link the entry with the audio file.
— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/wikipron/issues/466#issuecomment-1163750644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPWRCRIQ3TOBEJW343VQONE3ANCNFSM5ZKXAIRA . You are receiving this because you commented.Message ID: @.***>
at least one person has suggested it would be useful to them
This would be very useful to me as well (the "word+transcription+URL" combo). Anything I could help with here?
There's a paper at LREC that seems to do exactly this: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf
If it meets the stated need, I'd say that WikiPron doesn't have to do it, you can just merge whatever you want from WikiPron with that source.
I think I am going to close this as wontfix because I don't seeing us doing this in the near future.
Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.