Prefer newer JSON cache files to older

takluyver commented 5 months ago

The 'run map' JSON cache is used to speed up opening a run when possible, by listing the sources & trains to be found in each file. There are a couple of possible locations for this: it can be inside the run folder, where only privileged processes can write, or in a hidden world-writable folder under proposal scratch. In master, a cache in the run folder is preferred both for reading & writing.

In some cases, the calibration pipeline creates a partial cache file in the run folder, by opening a subset of files with the include= parameter. When someone later opens the whole run folder, a complete cache file is written in scratch, but then this is never used, because the one in the run folder takes priority.

This change looks for the newest available run map file when reading.

Testing: I was pointed to p5733 r120 as an example. Using the script below, I get a minimum of about 0.4 s to open the run on master, and around 0.1 s on this branch.

Using strace, I can also see that it opens 320 .h5 files on master, and 0 on this branch, because the information is all cached.

import time

from extra_data import open_run

t0 = time.perf_counter()
run = open_run(5733, 120, data='proc')
t1 = time.perf_counter()

print(f"{len(run.train_ids)} trains")
print(f"{t1 - t0:.3f} s to open")

philsmt commented 5 months ago

I can imagine this was a tricky one. LGTM for the implementation apart from aesthetic question.

Should be avoid creating the partial cache in calibration, since it hardly serves a purpose?

takluyver commented 5 months ago

Should be avoid creating the partial cache in calibration, since it hardly serves a purpose?

I think it still speeds up the first full open, because it should use the cache for the files in it. But it would arguably be tidier not to create an incomplete cache file.

European-XFEL / EXtra-data

Prefer newer JSON cache files to older #524