Closed takluyver closed 5 years ago
I haven't checked the code yet, but do you think it would make sense to have a service scanning all new runs and writing the cache somewhere readable by everybody, like software
?
Yeah, once we're happy with how it operates, it would be good to run it automatically on every new run. I'd say the ideal would be to write the map into the run directory itself, because then it doesn't matter which path you access it through, e.g. /gpfs/exfel/exp/...
vs /gpfs/exfel/d/raw/...
.
I'm also still thinking about different formats - e.g. can we read an HDF5 file faster than JSON?
HDF5 is not faster. I got load times of roughly:
Pickle also produces the smallest files (example 1.9 MB, vs 2.7 MB HDF5 or 3.8 MB JSON). But of course the files are not as easily inspectable as JSON.
For lsxfel, any performance difference is eclipsed by import time - I'm preparing another PR to improve this, but it will still be on the order of 300 ms.
Yeah, once we're happy with how it operates, it would be good to run it automatically on every new run. I'd say the ideal would be to write the map into the run directory itself, because then it doesn't matter which path you access it through, e.g. /gpfs/exfel/exp/... vs /gpfs/exfel/d/raw/....
Yes, that would be best to have it in the run directory itself, but this has additional complications... :) But if we can do this in the end, maybe it makes more sens to have it in HDF5 format, even if it is not the best format for this. I also remember that IT wanted to have such a file containing only run metadata, do you have any new on this?
I emailed them a couple of weeks ago to ask about that, and suggest that if we write the code they could set it up to run on each run. Haven't heard anything back.
I think the performance difference is big enough that I'd prefer to avoid HDF5 for this purpose unless someone makes it a hard requirement. Unless you can see a more efficient way to use it - I've put the code I've been playing with in a gist, and there are sample files in /gpfs/exfel/exp/SCS/201901/p002212/scratch/.karabo_data_maps
.
LGTM
Not merging this yet, because:
From discussion with @fangohr this morning:
It's working with getting an fd from HDF5.
However, while testing it I ran into a limitation that I thought wouldn't really matter. If you make a symlink to access data directories more conveniently, it doesn't recognise that it's a standard data directory, so it doesn't use the cache. I have two ideas at the moment to solve this:
/raw/r0123
or /proc/r1234
is part of a proposal directory, and use (prop_dir)/scratch/.karabo_data_maps
relative to those.os.path.realpath()
and check against the real data locations on Maxwell. This is unfortunate because the real paths change (from /gpfs/exfel/d/...
to /pnfs/xfel.eu/...
), and they're meant to be an internal implementation detail.If we resolve the simlink, the path should always be .../(raw|proc)/INSTRUMENT/CYCLE/PROPOSAL/RUN
, right?
If that's the case, solution 2.
has my favor. We don't really care if the location change as long as the end of the resolved path does not change.
e.g.
/pnfs/xfel.eu/exfel/archive/XFEL/proc/SPB/201701/p002012/r0001
/gpfs/exfel/d/raw/SPB/201901/p002316/r0001
That's an improvement, but I think it still requires hardcoding an absolute path to find the scratch directory where we store the cache - because scratch stays on gpfs even when the data moves to /pnfs
.
This seems to be working.
The p002212 raw files we were testing with have been migrated to dCache today. It seems like moving data to dCache changes the file mtimes, so the cache has to be regenerated after that. That's unfortunate, but we can live with it.
If we can get the cache file generated whenever a run is written, we'll have to ensure it's regenerated when the data is moved to dcache, otherwise the cache file will get copied across with the wrong mtimes and then we'll never use it. :-(
If we can get the cache file generated whenever a run is written, we'll have to ensure it's regenerated when the data is moved to dcache, otherwise the cache file will get copied across with the wrong mtimes and then we'll never use it. :-(
That's something we should clarify with IT, but if the cache files are part of the run directory, then it cannot be modified or deleted and there is no need for checking mtime
anymore, right? (this only applies for raw data of course)
In principle, yes. I'm wary of writing any cache implementation that assumes the data files could never chnage without the cache being invalidated. But maybe we can make a pragmatic compromise.
LGTM!
Opening a run is slow: we read a little bit of data from each file to know what sources and trains are in that file. This appears to be especially bad if the files are not already cached on that node: it seems like gpfs moves big chunks of data around. But it's slow even when the data is cached locally.
Testing with
lsxfel /gpfs/exfel/exp/SCS/201901/p002212/raw/r0070
(~400 files, 1.7 TB) , I got roughly:I was hoping that lazy opening would give a further substantial speedup. It doesn't seem to, so I might still revert it. But I want to test more if it makes a difference when files are not cached locally.
The worst case is unchanged: if the cache file isn't there it still reads metadata from each file. But hopefully that only needs to happen once. Multiple users can share the cache file - it just needs someone with write access to the proposal's scratch dir to create it.