audeering / audbcards

Data cards for audio datasets
https://audeering.github.io/audbcards/
Other
0 stars 0 forks source link

Use system-wide audb cache #55

Closed ChristianGeng closed 7 months ago

ChristianGeng commented 7 months ago

Closes #53

ChristianGeng commented 7 months ago

Currently the tests are failing here. The reason is that the CI user probably does not have a home directory. One obvious fix would be to set the cache variables in the github action as a environment variables.

I would have questions on how cache variables are handled in audb. For me audb.config.CACHE_ROOT is not overriden by the environment variable, but the audb.default_cache_root function uses the env variables. Is this the desired behavior?

"""

Assumption: env variables exported

export AUDB_CACHE_ROOT=/new/local/cache/audb
export AUDB_SHARED_CACHE_ROOT=/new/shared/cache/audb

"""
import audb
import os

cache_root_from_env = os.getenv('AUDB_CACHE_ROOT')
print("cache root env variable: ", cache_root_from_env)
shared_cache_root_from_env = os.getenv('AUDB_SHARED_CACHE_ROOT')
print("shared cache env variable value: ", shared_cache_root_from_env)

# use definitions do not reflect env variables
print("audb config cache root:", audb.config.CACHE_ROOT)
print("audb config shared cache root: ", audb.config.SHARED_CACHE_ROOT)

# when using cache methods it consumes from environment variables:
print("audb use method: " , audb.default_cache_root(shared=False))
print("audb use method: " , audb.default_cache_root(shared=True))

This results in

cache root env variable:  None
shared cache env variable value:  None
audb config cache root: /data/audb
audb config shared cache root:  /data/audb
audb use method:  /home/audeering.local/cgeng/Data/Data_HDD/audb
audb use method:  /home/audeering.local/cgeng/Data/Data_HDD/audb

One could of course create an audb.yaml in the github action, but I thought it would make sense to ask first.

hagenw commented 7 months ago

I would have questions on how cache variables are handled in audb. For me audb.config.CACHE_ROOT is not overriden by the environment variable, but the audb.default_cache_root function uses the env variables. Is this the desired behavior?

Yes, this is desired behavior.

Currently, we have a fixture in tests/conftest.py that sets the audb cache for the tests:

@pytest.fixture
def audb_cache(tmpdir, scope="session", autouse=True):
    """Local audb cache folder."""
    cache = audeer.mkdir(audeer.path(tmpdir, "audb-cache"))
    audb.config.CACHE_ROOT = cache
    audb.config.SHARED_CACHE = cache

So, there shouldn't be any reason why it tries to use the home folder (which should also not fail in the CI pipeline).

hagenw commented 7 months ago

The reported error seems also a real error to me:

audbcards/core/datacard.py:199: in player
    shutil.copy(
../../../hostedtoolcache/Python/3.10.13/x64/lib/python3.10/shutil.py:417: in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

src = '~/audb/medium_db/1.0.0/d3b62a9b/data/f0.wav'

as copyfile cannot handle ~. We need to use audeer.path() and wrap it around src.

hagenw commented 7 months ago

I guess the error can be fixed when replacing

        media_src_dir = ( 
            f"{self.dataset.cache_root}/"
            f"{audb.flavor_path(self.dataset.name, self.dataset.version)}"
        )   

with

        media_src_dir = ( 
            f"{audb.default_cache_root()}/"
            f"{audb.flavor_path(self.dataset.name, self.dataset.version)}"
        )   
ChristianGeng commented 7 months ago

Replacing audb.config.CACHE_ROOT with with audb.default_cache_root() was sufficient to get back on track. Thanks a lot. I was unaware and it was also counterintuitive for me that audb.config.CACHE_ROOT does not relate to the set environment variable and that audb.default_cache_root() must be used instead.

hagenw commented 7 months ago

The idea is that an environment variable can overwrite a config entry. But this does not mean that it can change the config entry, which would feel strange to me. Currently we have the following hierarchy when looking for the cache root (see https://audeering.github.io/audb/caching.html#changing-cache-locations):

  1. cache_root argument during a function call
  2. system-wide by setting environment variables
  3. program-wide by changing audb.config.CACHE_ROOT
  4. system-wide by using the configuration file ~/.audb.yaml (which indeed overwrites the default values of audb.config.CACHE_ROOT
hagenw commented 7 months ago

But I guess, it differs what feels natural to a user. That's the reason why we also have audb.default_cache_root() which always returns the cache root as set by the config file, variable, or environment variable.

ChristianGeng commented 7 months ago

But I guess, it differs what feels natural to a user. That's the reason why we also have audb.default_cache_root() which always returns the cache root as set by the config file, variable, or environment variable.

The order that you mention does not seem awkward. There is the programmatic manipulation that is not included here. The bit that confused me initially that the env variable does not change the config entry.

But now that I know it, I am of course fine with it. I hadn't heard about audb.default_cache_root() before.

So I would go ahead and merge this.