Caching of existing files

dhaitz commented 8 years ago

When lumi/pileup files are already present in the repository, the files are still queried as they are not in the local cache. Is this behaviour on purpose? IMHO if someone else has checked in the files it would be convenient to simply use them

maxfischer2781 commented 8 years ago

This is unintended. The intention is for files to be considered regardless where they come from, unless they are outdated. The cache metadata is stored in a git-friendly way on purpose.

I'm guessing this is caused by storing dependencies as absolute paths. Note that the cache will report why it needs regeneration if its log level is set low enough. Running excalibur.py with --log-level cache:info core:info conf:info should allow debugging this.

dhaitz commented 8 years ago

Hi Max, thanks for the quick reply! I've found that the cache does not check the presence of the actually needed (lumi/pileup) files, but the cache (.pkl) file. Also, because of https://github.com/artus-analysis/Excalibur/blob/master/cfg/python/configutils.py#L199 the .pkl lumi files are stored in git in data/json/, while the pileup .pkl files are in cache/. I'm not sure whether the .pkl files are supposed to be in git ... Personally I'd put them in cache/ and have the caching mechanism look for the actual pu/lumi files

maxfischer2781 commented 8 years ago

The .pkl files must be in git as well. It includes the metadata of the cache. When invoked, the cache must check the current state of dependencies against the old state of dependencies. This old state is stored in the .pkl. Implicitly, the cache does check the JSON - both original and derived JSONs are dependencies (e.g. here and here). If the .pkl does not exist, the cache will break at this point already, before checking the JSON.

I agree that having some things in cache/ and others in data/ is not clean. Especially figuring out what to commit and what not. I'll write up how this could be resolved soon.

maxfischer2781 commented 8 years ago

If somebody has the time (*hint*hint) there's an alternative: have a global and local cache mode. The main problem is that putting cache/generated files into git is very messy. Both git and cache do a kind of VCS, but the first is persistent while the later is volatile. What is actually needed is a way to share caches.

Local: all .pkl and automatically generated files go to $(pwd)/cache/<subdir>, e.g. /usr/users/mfischer/Excalibur/cache/data/lumi. This is the default, working as now. Everybody has his own cache. One can copy (or commit) it between users if needed.
Global: all .pkl and automatically generated files go to $(EXCALIBURCACHE)/<subdir>, e.g. /storage/a/dhaitz/excalibur_cache/data/lumi. This is activated by defining EXCALIBURCACHE. This acts as a shared cached: someone populates it (w/ CERN accounts), then everybody can use it (w/o CERN accounts). Once can copy it between hosts if needed.

The functionality for this is already there, but it needs cleanup and consolidation. Required changes:

cached_query expects and cache_dir as a relative path. E.g., the default path of RunJSON would be just os.path.join("data", "json").
- It might suffice to just do cache_dir = os.path.relpath(cache_dir) at the start of cached_query. I strongly advice against it, to keep cache magic to a minimum.
cached_query expects dependency_files and dependency_folders as absolute paths, relative paths, or beginning with "%(cache_dir)s". The later must be used when files are generated. E.g. generated JSONs would now go to os.path.join("%(cache_dir)s", self._store_path, os.path.basename(json_name))
- Using relative paths with a global cache or absolute paths inside cwd with a local cache may be ill-defined (aka useless). See below.
cached_query performs a switch between global and local cache at the beginning.
- If global, convert relative paths to absolute. If local, convert absolute paths to relative if a subpath of os.getcwd(). Paths using cache_dir should be resolved dynamically. I guess it should look something like this:

def cached_query(*args, **kwargs):
    try:
        cache_dir = os.path.join(os.environ['EXCALIBURCACHE'], cache_dir)
    except KeyError:  # thrown by os.environ
        # local cache
        cache_dir = os.path.join(getPath(), "cache", cache_dir)
        # convert *true* relative paths
        dependency_files = [get_relsubpath(dpath) for dpath in dependency_files]
        dependency_folders = [get_relsubpath(dpath) for dpath in dependency_folders]
    else:  # this triggers if no exception is raised
        # global cache
        # purge relative paths
        dependency_files = [os.path.abspath(dpath) for dpath in dependency_files]
        dependency_folders = [os.path.abspath(dpath) for dpath in dependency_folders]
    # define this once cache_dir is set
    def stat_file(file_path):
        """Get a comparable representation of file validity"""
        try:
            file_stat = os.stat(file_path % {'cache_dir': cache_dir})  # resolve cache_dir dynamically
            return file_stat.st_size, file_stat.st_mtime
        except OSError:
            return -1, -1
    # run the rest of cached_query
    # ...

I guess that should do the trick, but there's only so much coffee in one day...

dhaitz commented 8 years ago

For now, I stuck to a simple fix :) https://github.com/artus-analysis/Excalibur/commit/6b88c754 more advanced solutions maybe after holiday

maxfischer2781 commented 8 years ago

Since both Dominik and me are no longer at EKP, please let me know if there is help required on this. I don't have a working test environment anymore.

KIT-CMS / Excalibur

Caching of existing files #49