CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
321 stars 71 forks source link

scraping audio files? #466

Closed jhdeov closed 1 year ago

jhdeov commented 2 years ago

Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.

kylebgorman commented 2 years ago

Probably, and at least one person has suggested it would be useful to them. (I myself don't have a use yet but I like the idea.)

I wonder if this would exceed what we can store on GitHub directly (just in terms of overall repo size, I think the limit is 5 GB), though, and if so we would have to do something like make a local download then upload to, IDK, S3 or something like that and generate a link.

On Mon, Jun 20, 2022 at 7:47 PM Hossep Dolatian @.***> wrote:

Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation https://en.wiktionary.org/wiki/%D5%A3%D6%80%D5%A5%D5%AC section.

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/wikipron/issues/466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OJEZDPIFZU7XORZJILVQEUMVANCNFSM5ZKXAIRA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jhdeov commented 2 years ago

A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of word+transcription+URL, and b) the script bulk-downloads the audio files of the URL.

For example, the Wiktionary page of this word has a link to an audio file.

PS: this could be useful for someone trying out ASR using Wiktionary :D

kylebgorman commented 2 years ago

That's a good idea.

The person who is probably most in the market for this is Alan Black at CMU.

On Wed, Jun 22, 2022 at 4:16 PM Hossep Dolatian @.***> wrote:

A possible cheat is that you extract only the URL of the audio file (that I think is hosted on some Wiki domain). Then you can suggest in the README some type of script where a) the user provides a Wikipron-made text file of worth+transcription+URL, and b) the script bulk-downloads the audio files of the URL.

For example, the Wiktionary page of this word https://en.m.wiktionary.org/wiki/%D5%A3%D6%80%D5%A5%D5%AC has a link https://en.m.wiktionary.org/wiki/File:Hy-%D5%A3%D6%80%D5%A5%D5%AC.ogg to an audio file.

I know that the way Wiktionary works is that users independently upload their audio recordings into some Wiki site. And then Wiktionary links a Wiktionary entry (if it exists) to the audio file (if it exists). That way, a user can make an audio file let's say today, but then next week they make the Wiktionary entry, and then Wiktionary will link the entry with the audio file.

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/wikipron/issues/466#issuecomment-1163750644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPWRCRIQ3TOBEJW343VQONE3ANCNFSM5ZKXAIRA . You are receiving this because you commented.Message ID: @.***>

rovr commented 2 years ago

at least one person has suggested it would be useful to them

This would be very useful to me as well (the "word+transcription+URL" combo). Anything I could help with here?

kylebgorman commented 2 years ago

There's a paper at LREC that seems to do exactly this: http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf

If it meets the stated need, I'd say that WikiPron doesn't have to do it, you can just merge whatever you want from WikiPron with that source.

kylebgorman commented 1 year ago

I think I am going to close this as wontfix because I don't seeing us doing this in the near future.