audeering / audbcards

Data cards for audio datasets
https://audeering.github.io/audbcards/
Other
0 stars 0 forks source link

Adjust example on Datacard based on actual mime-type of media file #89

Open hagenw opened 2 months ago

hagenw commented 2 months ago

At the moment, we display all example media files as audio on a datacard, e.g.

image

This also works for video files, but displays only the audio. Further we select the example file to show based on its duration.

I would propose the following improvement:

For both points to work, we will need to check what is the mimetype of a corresponding media file.

Another question that arises is, how to handle examples for datasets that contain a mixture of different media types. At the moment we use the dependency table of a dataset to select a meaningful example, but the dependency table stores no information about the mime type of the included file.

maxschmitt commented 2 months ago

Having the number of characters/words as a database property would also be a benefit.

hagenw commented 2 months ago

I agree, but I'm afraid that will not be easy to achieve afterwards. We can only easily access information on sampling rate, duration and other media related properties, as we currently store them in the dependency table when publishing a dataset. Otherwise we would need to download every single media file to get those statistics. number of characters/words seems very related to this. If we would like to extract them inside audbcards we would need to download the complete dataset first.

If we really think we need that information (and maybe others about text media files), we will have to extend the dependency table in audb.

maxschmitt commented 2 months ago

I see, downloading all files might be cumbersome for the larger text datasets, so, it makes sense to skip this for the moment, unless we have a suitable text container format other than plain text, which supports metadata.

hagenw commented 2 months ago

unless we have a suitable text container format other than plain text, which supports metadata.

But even then you would have to download all files to collect the metadata over all of them. In principle, what we need is something that gathers the information when we publish the dataset, as then we have to visit anyway every file to calculate the MD5 sum. As I said, for audio/video files we extract information on sampling rate, channels, bit depts, etc. during that phase and then simply write it to the dependency table, that also tracks the versioning of the files (as we didn't had any better solution). You could also envision a central database, that stores such metadata, but our goal was to be de-central with audb. @ChristianGeng any thoughts on this?

ChristianGeng commented 2 months ago

You could also envision a central database, that stores such metadata, but our goal was to be de-central with audb.

@ChristianGeng any thoughts on this?

I would have nothing against a central database, but I think it should not become mandatory to use - for a specific backend deployment. But probably then it is hard to implement. What one thinks of first is a kind of hook mechanism.

Artifactory has a webhook mechanism too, but these are too late in the process chain and require that you implement a rest service that executes for such a thing. So overkill.

On the client side there are other problems: such things are often implemented as decorators so it would not be too involved to implement say @onpublish decorators. The tricky bit would be how to make sure that every call to audb.publish on the audeering-internal servers really throws when the deployment-specific @onpublish is not called. I know that there is .audb.yaml, but this is a user-, not a deployment- specific setting. So in short, we probably cannot configure audb for specific deployments, can we?

hagenw commented 2 months ago

Good point. I think when we want to store additional information during audb.publish() it seems the easiest solution to extend what is stored in the dependency table.

For the approach with the database, I could indeed envision that we have one internally, that is used when creating the HTML overview pages. And can maybe also accessed by single users to request entries. But I would not fill such a database directly during publication, but have a cron job on a compute server running, filling up the database. The only downside would be that it would fill up the shared cache on the compute servers with all datasets. But maybe, we can also call this a feature?