Open dhaitz opened 8 years ago
This is unintended. The intention is for files to be considered regardless where they come from, unless they are outdated. The cache metadata is stored in a git-friendly way on purpose.
I'm guessing this is caused by storing dependencies as absolute paths. Note that the cache will report why it needs regeneration if its log level is set low enough. Running excalibur.py
with --log-level cache:info core:info conf:info
should allow debugging this.
Hi Max, thanks for the quick reply! I've found that the cache does not check the presence of the actually needed (lumi/pileup) files, but the cache (.pkl) file. Also, because of
https://github.com/artus-analysis/Excalibur/blob/master/cfg/python/configutils.py#L199 the .pkl lumi files are stored in git in data/json/
, while the pileup .pkl files are in cache/
. I'm not sure whether the .pkl files are supposed to be in git ... Personally I'd put them in cache/
and have the caching mechanism look for the actual pu/lumi files
The .pkl
files must be in git as well. It includes the metadata of the cache.
When invoked, the cache must check the current state of dependencies against the old state of dependencies. This old state is stored in the .pkl
. Implicitly, the cache does check the JSON - both original and derived JSONs are dependencies (e.g. here and here).
If the .pkl
does not exist, the cache will break at this point already, before checking the JSON.
I agree that having some things in cache/
and others in data/
is not clean. Especially figuring out what to commit and what not. I'll write up how this could be resolved soon.
If somebody has the time (*hint*hint) there's an alternative: have a global and local cache mode. The main problem is that putting cache/generated files into git is very messy. Both git and cache do a kind of VCS, but the first is persistent while the later is volatile. What is actually needed is a way to share caches.
.pkl
and automatically generated files go to $(pwd)/cache/<subdir>
, e.g. /usr/users/mfischer/Excalibur/cache/data/lumi
. This is the default, working as now. Everybody has his own cache. One can copy (or commit) it between users if needed..pkl
and automatically generated files go to $(EXCALIBURCACHE)/<subdir>
, e.g. /storage/a/dhaitz/excalibur_cache/data/lumi
. This is activated by defining EXCALIBURCACHE
. This acts as a shared cached: someone populates it (w/ CERN accounts), then everybody can use it (w/o CERN accounts). Once can copy it between hosts if needed.The functionality for this is already there, but it needs cleanup and consolidation. Required changes:
cache_dir
as a relative path. E.g., the default path of RunJSON would be just os.path.join("data", "json")
.
cache_dir = os.path.relpath(cache_dir)
at the start of cached_query
. I strongly advice against it, to keep cache magic to a minimum.dependency_files
and dependency_folders
as absolute paths, relative paths, or beginning with "%(cache_dir)s"
. The later must be used when files are generated. E.g. generated JSONs would now go to os.path.join("%(cache_dir)s", self._store_path, os.path.basename(json_name))
cwd
with a local cache may be ill-defined (aka useless). See below.os.getcwd()
. Paths using cache_dir
should be resolved dynamically. I guess it should look something like this:def cached_query(*args, **kwargs):
try:
cache_dir = os.path.join(os.environ['EXCALIBURCACHE'], cache_dir)
except KeyError: # thrown by os.environ
# local cache
cache_dir = os.path.join(getPath(), "cache", cache_dir)
# convert *true* relative paths
dependency_files = [get_relsubpath(dpath) for dpath in dependency_files]
dependency_folders = [get_relsubpath(dpath) for dpath in dependency_folders]
else: # this triggers if no exception is raised
# global cache
# purge relative paths
dependency_files = [os.path.abspath(dpath) for dpath in dependency_files]
dependency_folders = [os.path.abspath(dpath) for dpath in dependency_folders]
# define this once cache_dir is set
def stat_file(file_path):
"""Get a comparable representation of file validity"""
try:
file_stat = os.stat(file_path % {'cache_dir': cache_dir}) # resolve cache_dir dynamically
return file_stat.st_size, file_stat.st_mtime
except OSError:
return -1, -1
# run the rest of cached_query
# ...
I guess that should do the trick, but there's only so much coffee in one day...
For now, I stuck to a simple fix :) https://github.com/artus-analysis/Excalibur/commit/6b88c754 more advanced solutions maybe after holiday
Since both Dominik and me are no longer at EKP, please let me know if there is help required on this. I don't have a working test environment anymore.
When lumi/pileup files are already present in the repository, the files are still queried as they are not in the local cache. Is this behaviour on purpose? IMHO if someone else has checked in the files it would be convenient to simply use them