European-XFEL / EXtra-data

Access saved EuXFEL data
https://extra-data.rtfd.io
BSD 3-Clause "New" or "Revised" License
7 stars 14 forks source link

Avoid giant memory allocations when reading xtdf data with pulse selection #576

Closed takluyver closed 4 days ago

takluyver commented 5 days ago

Fixes #573

To work around an HDF5 performance issue, we read pulse-selected data into a temporary array which remains mostly empty. This was being allocated per file, but then we introduced virtual overview files, which meant it was effectively allocated per run. Even though we don't use most of the space, there's still a limit on how much you can allocate.

The fix is in two parts:

With these changes, I can load a 10 GB single pulse selection from data that's 1.6 TB in total (the example in #573).

I don't think there's an easy way to test this; it's not meant to produce any visible change in behaviour except when memory allocation would fail.

philsmt commented 4 days ago

Looks good, LGTM.

The word chunk is a bit unfortunate when working with HDF.

Do you have some rough performance numbers for different values of length_limit? I wonder if one would want to expose it as an optional argument if there's a significant impact, though I doubt it.

takluyver commented 4 days ago

I tried before and after the second commit (which adds length_limit) loading 1 pulse per train for 1000 trains; the limit I set would break this into 13 chunks. As usual for reading data, timings are all over the place, but in both versions I got a best time a bit under 400 ms. It was similar with _use_voview=False, which (before length_limit) would split this data into 2-3 pieces (500 trains per file).

So it looks like the limit I've set doesn't make a big difference, and intuitively that makes sense - 5 GB should be big enough that the overhead of Python & HDF5 calls is swamped by reading (even if we're only actually reading a small fraction of that data).

I guess I should also try skipping the temporary array and reading frames one by one, to see how that compares.

philsmt commented 4 days ago

I would recommend to not spend too much time trying to tune the parameters for the current serial loop. I recently made the observation again that almost any form of process parallelization trumps even the most carefully crafted sequential loop. So we can go with this PR as is, and rather spend the time parallelizing it, e.g. via pasha.

Context: I made a branch to parallelize AdqRawChannel.pulse_data(), but just trivially throwing pasha at it and working by train was hardly if at all slower than cleverly trying to distribute the chunked loop over workers.

takluyver commented 4 days ago

OK, I just tested reading frame by frame with code like this (inside _read_chunk):

# Read a subset of pulses from the chunk:
dataset_idxs = np.nonzero(inc_pulses_chunk)[0] + chunk_slice.start
# Where does this data go in the target array?
tgt_start_ix = self._sel_frames[:tgt_slice.start].sum()
for tgt_ix, read_ix in enumerate(dataset_idxs, start=tgt_start_ix):
    chunk.dataset.read_direct(
        mod_out, source_sel=(read_ix,) + roi, dest_sel=(tgt_ix,) + roi
    )

This is still several times slower - best time about 1.9 seconds compared to 0.4 s. And I think my test case, reading only 1 pulse out of every 510, is about the most favourable for this strategy. So I think the temporary arrays can stay.

takluyver commented 4 days ago

Thanks for the review. :smiley: