Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
686 stars 131 forks source link

Allow researchers to easily download audio recordings #2502

Open jiru opened 3 years ago

jiru commented 3 years ago

Story

Researchers are using audio recordings from Tatoeba in their work, for example to evaluate speech recognition models. Currently, audio consumers need to download them manually or craft some script or rely on outdated datasets.

Lately we received an email from a researcher whose speech recognition model spot a few mismatches between audio and text. Even if we fix the errors, researchers won’t get the fix if they don’t get updated audio.

Ideas

LBeaudoux commented 3 years ago

Create and maintain a script/tool that can automatically download all the audio from Tatoeba for a given language. The tool should also allow updating after the initial download.

Maybe I could add this feature to my tatoebatools library?

jiru commented 3 years ago

Of course, but I am afraid that a solution based solely on the CSVs and regular HTTP download would have some limitations by design, such as:

LBeaudoux commented 3 years ago

Slow download because the many small files (could be worked around with parallel downloads, but it might easily trigger the server rate limitations)

I assumed that Tatoeba would periodically generate and release monolingual archives of audio files.

How can the script know if a given audio has been updated or removed?

If the created and modified fields of the audios table are added to the sentences_with_audio CSV export files, then the versions of the local audio files could be compared to the versions of the files available online so that only newly created or modified files are downloaded.

jiru commented 3 years ago

I assumed that Tatoeba would periodically generate and release monolingual archives of audio files.

If we are to generate such archives, then there is little need to use a dedicated download script because it’s as simple as downloading the current CSVs. Besides, as a drawbacks it becomes very hard to allow delta updates. I think we may implement either static archives or a dedicated script, not both.

If the created and modified fields of the audios table are added to the sentences_with_audio CSV export files

Yes. There is one minor drawback however: having those field modified is no guarantee that the audio file did actually change. For example, if an admin temporarily disables audio from the sentence page for some reason, and then re-enables it, then the created field gets changed but the audio file doesn’t need be downloaded again.

LBeaudoux commented 3 years ago

Maybe Tatoeba should opt for an object storage service to manage its media files as many other websites do.

Those services offer sync tools. For example, you can mirror files from an AWS S3 bucket to a local directory with a command like this one: aws s3 sync s3://mybucket .

jiru commented 2 years ago

I had a look at https://archive.org/, I think it could be a good place to host the audio that is under creative commons. They provide a cli called ia to automate upload, and the underlying architecture is basically s3.

I considered publishing the audio under the community audio collection, however it looks like it’s more for audio people would normally listen to, such as music, movies, podcasts etc. The dataset collection looks more appropriate. They ask not to upload more than 10k files per set, so we’d have to create zip archives. Since the license and language can only be specified on the item level (not file level), we’d rather create one item containing just one zip, for each voice/language pair.

Zipping could be tricky because CK’s dataset is about 15GB already, and we don’t have that much space to write such a big temporary zip file before upload, and ia does not allow "streamed" upload because it fails without a Content-Length header.

Properly setting the language metadata would require some conversion because they use MARC codes, which is more or less ISO 639-2. (We use ISO 639-3 and there are subtle differences such as fre instead of fra.)

LBeaudoux commented 2 years ago

we don’t have that much space to write such a big temporary zip file before upload

Moving all audio files used in production to an object-storage service would free up a lot of storage from Tatoeba server. Then, you could also zip and upload to Internet Archive from the cloud.

they use MARC codes, which is more or less ISO 639-2

As long as you stick to the codes and don't use language names, it seems that MARC codes match ISO 639-2 bibliographic codes. In the case you chose python for your script, I recommend you this library I created to handle language code conversion.