Closed dscran closed 5 years ago
Thanks!
We're also working on speeding up opening a run by caching the information about what data is in each file (#206), so you only actually open the files when you try to get data from them. This should provide significant speed ups, but of course the cache needs to be created first. We're hoping to get it created automatically when a run is written, but running anything automatically means getting a separate department involved.
I'd like to explore how easy that will be first, because if that all works, we can get the same performance benefits with no extra options needed - you could quickly open a run and then select the sources of interest with the existing APIs. But we should keep this in mind as a stopgap solution if it looks like it will take a long time to get the caches pre-populated.
On further thought: let's do this. Getting the caches pre-populated will probably take a while, and in the meantime this is a substantial improvement for people who do know about the filenames and only want to access some of them. I think this is a case where "practicality beats purity".
I would recommend a couple of changes, though:
include='*DSSC00*'
) rather than substring matches. We're already using them elsewhere in karabo_data, and they're a natural fit for selecting files, because that's what we usually use them for. Search the code for fnmatch
to see how to check them efficiently. (Although it looks like a perfect fit, avoid the Python glob
module in this case - we did use it in the past, but it silences permission errors.)include
, not exclude
. exclude='*DSSC*'
should be equivalent to include='*DA*'
- all data apart from the big detectors is saved through a Data Aggregator. I can't immediately see a realistic use case for exclude
that can't be replaced by include
.Feel free to push back on either of these if you think I'm overlooking something. :-)
Hi Thomas,
thanks for the feedback! I totally agree with your suggestions and updated the pull request accordingly. I'm very much looking forward to automatically cached run infos, though :)
Cheers,
Michael
Thanks! One more nitpick: I think the matching should apply only to the filenames, not their paths. I can't think of a case where it would make a difference with the canonical paths on Maxwell, but it might if someone has copied some data to another location.
Finally, can I ask for a test? Look in karabo_data/tests/test_reader_mockdata.py
, at test_read_fxe_raw_run
for example. We just need to open the RunDirectory with an include pattern, and check one source that should be included and one that shouldn't.
Thanks!
karabo_data 0.7 is out now with this change and the run map caching. I've got a batch job populating the caches for proposal 2212, so this is now fast (well, a lot faster, anyway):
lsxfel /gpfs/exfel/exp/SCS/201901/p002212/raw/r0027
Opening a run in Python will populate the cache if it's not already populated, as will running the lsxfel
command on the run directory.
Hi all,
opening the RunDirectory can take a significant amount of time for long runs with many sources. I often find that I only need a (known) subset of the DAQ files, in which case much of that time can be saved.
Examples:
The proposed change gives this functionality. Of course this could be generalized by passing a filter function, but I actually think this very simple implementation is sufficient and easier to use (though I'd be happy with the filter function approach as well).
What do you think?
Cheers,
Michael