Add a simple filter interface to RunDirectory

dscran commented 5 years ago

Hi all,

opening the RunDirectory can take a significant amount of time for long runs with many sources. I often find that I only need a (known) subset of the DAQ files, in which case much of that time can be saved.

Examples:

When setting up a distributed process for DSSC files, I'm only interested in non-DSSC data, so I'd use run = RunDirectory(path, exlude='DSSC'). Typically, this brings down the load time from ~5 minutes to less than 20 seconds... In an interactive session this is a huge benefit to me.
When processing the actual DSSC data, I parallelize over the different modules. In each subprocess I do run = RunDirectory(path, include=f'DSSC{module_nr:02d}'. This also makes sure that no file access conflicts arise when several subprocesses try to index the same files.

The proposed change gives this functionality. Of course this could be generalized by passing a filter function, but I actually think this very simple implementation is sufficient and easier to use (though I'd be happy with the filter function approach as well).

What do you think?

Cheers,

Michael

takluyver commented 5 years ago

Thanks!

We're also working on speeding up opening a run by caching the information about what data is in each file (#206), so you only actually open the files when you try to get data from them. This should provide significant speed ups, but of course the cache needs to be created first. We're hoping to get it created automatically when a run is written, but running anything automatically means getting a separate department involved.

I'd like to explore how easy that will be first, because if that all works, we can get the same performance benefits with no extra options needed - you could quickly open a run and then select the sources of interest with the existing APIs. But we should keep this in mind as a stopgap solution if it looks like it will take a long time to get the caches pre-populated.

takluyver commented 5 years ago

On further thought: let's do this. Getting the caches pre-populated will probably take a while, and in the meantime this is a substantial improvement for people who do know about the filenames and only want to access some of them. I think this is a case where "practicality beats purity".

I would recommend a couple of changes, though:

Let's use glob patterns (include='*DSSC00*') rather than substring matches. We're already using them elsewhere in karabo_data, and they're a natural fit for selecting files, because that's what we usually use them for. Search the code for fnmatch to see how to check them efficiently. (Although it looks like a perfect fit, avoid the Python glob module in this case - we did use it in the past, but it silences permission errors.)
In the interests of adding the minimum necessary API, I'd do only include, not exclude. exclude='*DSSC*' should be equivalent to include='*DA*' - all data apart from the big detectors is saved through a Data Aggregator. I can't immediately see a realistic use case for exclude that can't be replaced by include.

Feel free to push back on either of these if you think I'm overlooking something. :-)

dscran commented 5 years ago

Hi Thomas,

thanks for the feedback! I totally agree with your suggestions and updated the pull request accordingly. I'm very much looking forward to automatically cached run infos, though :)

Cheers,

Michael

takluyver commented 5 years ago

Thanks! One more nitpick: I think the matching should apply only to the filenames, not their paths. I can't think of a case where it would make a difference with the canonical paths on Maxwell, but it might if someone has copied some data to another location.

Finally, can I ask for a test? Look in karabo_data/tests/test_reader_mockdata.py, at test_read_fxe_raw_run for example. We just need to open the RunDirectory with an include pattern, and check one source that should be included and one that shouldn't.

takluyver commented 5 years ago

Thanks!

takluyver commented 5 years ago

karabo_data 0.7 is out now with this change and the run map caching. I've got a batch job populating the caches for proposal 2212, so this is now fast (well, a lot faster, anyway):

lsxfel /gpfs/exfel/exp/SCS/201901/p002212/raw/r0027

Opening a run in Python will populate the cache if it's not already populated, as will running the lsxfel command on the run directory.

Batch script to populate caches for all runs in a proposal

```bash #!/usr/bin/bash #SBATCH -p exfel #SBATCH -t 4:00:00 #SBATCH --nice=10000 # Lower priority set -euxo pipefail source /usr/share/Modules/init/bash module load exfel exfel_anaconda3 which lsxfel runs_path="/gpfs/exfel/exp/SCS/201901/p002212/raw" for run_dir in $runs_path/r* do lsxfel "$run_dir" done ```

European-XFEL / karabo_data

Add a simple filter interface to RunDirectory #221