Caching file contents for runs

takluyver commented 5 years ago

Opening a run is slow: we read a little bit of data from each file to know what sources and trains are in that file. This appears to be especially bad if the files are not already cached on that node: it seems like gpfs moves big chunks of data around. But it's slow even when the data is cached locally.

Testing with lsxfel /gpfs/exfel/exp/SCS/201901/p002212/raw/r0070 (~400 files, 1.7 TB) , I got roughly:

4 minutes worst case when gpfs has to transfer data onto the node
21 seconds reading data from each file when they've already been cached locally
1.6 seconds with the cache file implemented in the first commit here
1.4 seconds with the cache file plus lazy opening of HDF5 files (second commit)

I was hoping that lazy opening would give a further substantial speedup. It doesn't seem to, so I might still revert it. But I want to test more if it makes a difference when files are not cached locally.

The worst case is unchanged: if the cache file isn't there it still reads metadata from each file. But hopefully that only needs to happen once. Multiple users can share the cache file - it just needs someone with write access to the proposal's scratch dir to create it.

tmichela commented 5 years ago

I haven't checked the code yet, but do you think it would make sense to have a service scanning all new runs and writing the cache somewhere readable by everybody, like software?

takluyver commented 5 years ago

Yeah, once we're happy with how it operates, it would be good to run it automatically on every new run. I'd say the ideal would be to write the map into the run directory itself, because then it doesn't matter which path you access it through, e.g. /gpfs/exfel/exp/... vs /gpfs/exfel/d/raw/....

I'm also still thinking about different formats - e.g. can we read an HDF5 file faster than JSON?

takluyver commented 5 years ago

HDF5 is not faster. I got load times of roughly:

JSON: 40 ms (including converting train IDs to a numpy array, and source lists to frozensets)
HDF5: 400 ms (also tried with the 'core' driver which reads the file into memory first - about the same as using the defaults)
Pickle: 2 ms

Pickle also produces the smallest files (example 1.9 MB, vs 2.7 MB HDF5 or 3.8 MB JSON). But of course the files are not as easily inspectable as JSON.

For lsxfel, any performance difference is eclipsed by import time - I'm preparing another PR to improve this, but it will still be on the order of 300 ms.

tmichela commented 5 years ago

Yeah, once we're happy with how it operates, it would be good to run it automatically on every new run. I'd say the ideal would be to write the map into the run directory itself, because then it doesn't matter which path you access it through, e.g. /gpfs/exfel/exp/... vs /gpfs/exfel/d/raw/....

Yes, that would be best to have it in the run directory itself, but this has additional complications... :) But if we can do this in the end, maybe it makes more sens to have it in HDF5 format, even if it is not the best format for this. I also remember that IT wanted to have such a file containing only run metadata, do you have any new on this?

takluyver commented 5 years ago

I emailed them a couple of weeks ago to ask about that, and suggest that if we write the code they could set it up to run on each run. Haven't heard anything back.

I think the performance difference is big enough that I'd prefer to avoid HDF5 for this purpose unless someone makes it a hard requirement. Unless you can see a more efficient way to use it - I've put the code I've been playing with in a gist, and there are sample files in /gpfs/exfel/exp/SCS/201901/p002212/scratch/.karabo_data_maps.

tmichela commented 5 years ago

LGTM

takluyver commented 5 years ago

Not merging this yet, because:

I'm still considering putting the detector info into the cache: on the plus side, it saves opening and reading data from even one HDF5 file, but it does introduce an extra level of complexity into the cache for a specific purpose (lsxfel, not every time a run is opened).
I want to ensure @fangohr has time to properly consider different format options before we merge this.

takluyver commented 5 years ago

From discussion with @fangohr this morning:

[x] Extend karabo-data-validate to check files map
[x] Some way to show timing information while using karabo_data (logging?)
[x] Document location & contents of map files in case they go wrong.

takluyver commented 5 years ago

It's working with getting an fd from HDF5.

However, while testing it I ran into a limitation that I thought wouldn't really matter. If you make a symlink to access data directories more conveniently, it doesn't recognise that it's a standard data directory, so it doesn't use the cache. I have two ideas at the moment to solve this:

Assume any path which ends like /raw/r0123 or /proc/r1234 is part of a proposal directory, and use (prop_dir)/scratch/.karabo_data_maps relative to those.
Resolve any symlinks with os.path.realpath() and check against the real data locations on Maxwell. This is unfortunate because the real paths change (from /gpfs/exfel/d/... to /pnfs/xfel.eu/...), and they're meant to be an internal implementation detail.

tmichela commented 5 years ago

If we resolve the simlink, the path should always be .../(raw|proc)/INSTRUMENT/CYCLE/PROPOSAL/RUN, right?

If that's the case, solution 2. has my favor. We don't really care if the location change as long as the end of the resolved path does not change.

tmichela commented 5 years ago

e.g.

/pnfs/xfel.eu/exfel/archive/XFEL/proc/SPB/201701/p002012/r0001

/gpfs/exfel/d/raw/SPB/201901/p002316/r0001

takluyver commented 5 years ago

That's an improvement, but I think it still requires hardcoding an absolute path to find the scratch directory where we store the cache - because scratch stays on gpfs even when the data moves to /pnfs.

takluyver commented 5 years ago

This seems to be working.

The p002212 raw files we were testing with have been migrated to dCache today. It seems like moving data to dCache changes the file mtimes, so the cache has to be regenerated after that. That's unfortunate, but we can live with it.

If we can get the cache file generated whenever a run is written, we'll have to ensure it's regenerated when the data is moved to dcache, otherwise the cache file will get copied across with the wrong mtimes and then we'll never use it. :-(

tmichela commented 5 years ago

If we can get the cache file generated whenever a run is written, we'll have to ensure it's regenerated when the data is moved to dcache, otherwise the cache file will get copied across with the wrong mtimes and then we'll never use it. :-(

That's something we should clarify with IT, but if the cache files are part of the run directory, then it cannot be modified or deleted and there is no need for checking mtime anymore, right? (this only applies for raw data of course)

takluyver commented 5 years ago

In principle, yes. I'm wary of writing any cache implementation that assumes the data files could never chnage without the cache being invalidated. But maybe we can make a pragmatic compromise.

tmichela commented 5 years ago

LGTM!

European-XFEL / karabo_data

Caching file contents for runs #206