Add support for TXT files as media files

hagenw commented 5 months ago

Motivated by https://github.com/audeering/audformat/issues/376#issuecomment-2076772215

We want to support publishing databases, that do contain media files, that don't support audio/video related metadata (e.g. sampling rate). In audformat we can store such data anyway, the only problem was so far the dependency table in audb. Because when publishing a database, audb.core.publish._media_values() was trying to run audiofile.sampling_rate(file) on all media files, which obviously fails for non audio/video files.

There are two ways to solve this:

Catch the RuntimeError thrown by audiofile.sampling_rate(file) and treat the file automatically as non audio/video file
Set explicitly which non audio/video files should be supported

I have opted for the first solution at the moment, as in audformat we also support every file extension at the moment. An implementation of the second approach can be inspected at d6e9c7f.

In audb.load() we now raise a RuntimeError when a database is loaded with a requested flavor (e.g. sampling_rate=16000), but contains text files. From the docstring of audb.load():

Here the question is if we should change this, and instead of raising an error, only convert files to the requested flavor that we can convert. On the other hand, a user might expect that when using sampling_rate=16000 the loaded database will contain only files, that can be passed on to a model working with audio files provided in a sampling rate of 16000 Hz. For that reason, it felt saver to me, to raise an error.

maxschmitt commented 5 months ago

For that reason, it felt saver to me, to raise an error.

I agree that an error should be raised when loading a text dataset with audio-specific non-default parameters.

hagenw commented 5 months ago

I am just wondering if it is possible at the moment to have datasets with mixed audio and text media files (or if it should be supported).

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

maxschmitt commented 5 months ago

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

I see, we might not want the warning for text files in mixed databased, but there is no simple fix for that as the files are checked independently, I guess.

hagenw commented 5 months ago

This is possible. To make it more explicit, I extended the test to also publish a mixed database. The only problem is, that when loading such a database and requesting a flavor (e.g. sampling_rate=16000), you will get a RuntimeError as discussed above.

I see, we might not want the warning for text files in mixed databased, but there is no simple fix for that as the files are checked independently, I guess.

Yes, the files are checked independently, and a RuntimeError is raised as soon as one file is a text file. The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

maxschmitt commented 5 months ago

The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

Both makes sense if this is easy to fix.

I am just thinking of the case where a user loads a mixed dataset and thinks that there is only audio and then runs into an error when iterating over the text media files.

ChristianGeng commented 5 months ago

Yes, the files are checked independently, and a RuntimeError is raised as soon as one file is a text file. The alternative would be to silently skip all text files, or maybe to raise an error if only text files are present.

I have not studied the implementation in detail yet, so apologies in case this is an unmotivated comment. My understanding is that this is the situation:

	Content	expected behavior
(a)	contains media only	as before flavor ok
(b)	contains text only	errors when flavored
(c)	contains media and text	Alternatives in MR: error out
(d)	contains neither media or text	can this happen?

I believe that (c) is the problematic case.

From an api perspective, would not an implementation that is akin to what pandas often does, - e.g. here - be a good way to proceed:

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

Of course we cannot coerce, but apart from that, would be having default raise as implemented, and the possibility to ignore , i.e. proceed with conversion where possible be extremely hard to implement?

maxschmitt commented 5 months ago

(d) contains neither media or text can this happen?

This is what we currently do in the our text databases (as publishing non-audio/video files does not work). All samples/text is handled by misc tables.

hagenw commented 5 months ago

For case (d) of an empty database, we do indeed not raise an error, e.g.

import audb
import audeer
import audformat

build_dir = "./build"
audeer.rmdir(build_dir)
audeer.mkdir(build_dir)

db = audformat.Database("mydb")
db.save(build_dir)

host = audeer.path("./host")
repo = "repo"
audeer.rmdir(host)
audeer.mkdir(host, repo)
repository = audb.Repository(repo, host, "file-system")
audb.publish(build_dir, "1.0.0", repository)

audb.config.REPOSITORIES = [repository]
db = audb.load("mydb", version="1.0.0", sampling_rate=16000)

hagenw commented 5 months ago

From an api perspective, would not an implementation that is akin to what pandas often does, - e.g. here - be a good way to proceed:
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
Of course we cannot coerce, but apart from that, would be having default raise as implemented, and the possibility to ignore , i.e. proceed with conversion where possible be extremely hard to implement?

This sounds like a good solution. But we might think about a different name than errors, as when you get an error during conversion from mp4 to wav, because you don't have installed ffmpeg it should never continue without the conversion.

I will update the code tomorrow accordingly.

hagenw commented 5 months ago

Unfortunately, requesting a flavor influences also where the files are cached, and if the files in the index are renamed, etc. So it is not straightforward to change it.

For now I would propose the following behavior (as currently implemented):

	Content	expected behavior
(a)	contains media only	as before flavor ok
(b)	contains text only	errors when flavored
(c)	contains media and text	errors when flavored
(d)	contains neither media or text	as before flavor ok

If we want to have an option to request a flavor for a mixed dataset, we should open an issue for it, and work on it in a follow up pull request.

hagenw commented 5 months ago

Before merging this here, I would like to discuss one additional change we might want to consider here. At the moment we store in the dependency table the type of file, where type refers to

https://github.com/audeering/audb/blob/069cc042341ef244df6068651b518c439680b7b8/audb/core/define.py#L64-L69

We could add another value here for media files that are not video or audio, which might also help with how we might pick a meaningful example for a dataset (https://github.com/audeering/audbcards/issues/89).

I opted against this solution, as for me all files, that are not attachments and tables, in a database are media files (e.g. all are handled by the media argument in audb.load()). And I think that we should not add another type, but maybe you have another opinion?

hagenw commented 5 months ago

I thought about my last comment, and I would indeed not want to introduce another define.DependType for text files, as MEDIA refers in general to all files that are stored as data, that is listed in an index of some table. This is generic, and should be independent of the actual file format of the data.

@ChristianGeng I think it is fine to add this as "experimental" feature to the new audb release, we are going to do on Friday. Would this be fine with you as well?

ChristianGeng commented 5 months ago

I thought about my last comment, and I would indeed not want to introduce another define.DependType for text files, as MEDIA refers in general to all files that are stored as data, that is listed in an index of some table. This is generic, and should be independent of the actual file format of the data.

@ChristianGeng I think it is fine to add this as "experimental" feature to the new audb release, we are going to do on Friday. Would this be fine with you as well?

Yes I am fine with it. The same holds for the handling of "mixed datasets" as suggested above. Both, but in particular the second point seem to be quite far off what can be seen the scope of this MR. My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

hagenw commented 5 months ago

My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

Yes exactly, and for that reason I would continue with the simple solution introduced here, so we can publish (toy) example datasets with text as media files to make it easier to discuss what else we need to change at a later stage.

ChristianGeng commented 5 months ago

My understanding is that both require more conceptualization on how to get them going as this is a diffiult topic that might even involve discussing the Flavor class.

Yes exactly, and for that reason I would continue with the simple solution introduced here, so we can publish (toy) example datasets with text as media files to make it easier to discuss what else we need to change at a later stage.

Therefore it's approved now :-)

audeering / audb

Add support for TXT files as media files #392