Efficient memory chunking in input plugin

cfuselli commented 6 months ago

What does the code in this PR do / what does it improve?

Changes in file_loader class of input plugin to make it more memory efficient. This plugin was in some cases the one using the most amount of ram during processing. With this new implementation, the ram requirements are reduced by at least a factor 2, and same for the speed. Especially needed in view of the future PR #167 when we will need to keep all interactions also with zero ed.

Can you briefly describe how it works?

Instead of transforming the full awkward array with the interactions into a numpy array immediately, we use only the times to calculate chunk boundaries, and then we will only transform into numpy array after the chunks are defined.

So instead of doing this:

inter_reshaped = full_array_to_numpy(interactions, self.dtype)

we do first only this:

 interaction_time = awkward_to_flat_numpy(interactions["time"])

so we calculate the dynamic_chunking based on interaction_time ( instead of inter_reshaped['times'], but that is exactly the same array passed to the function). Once the boundaries are set, we do a "preselection" of interactions to transform to numpy, and transform to numpy only the selected interactions. Once the selected events are in the numpy/strax format, we do again a selection of the interactions parts of these events that are within the chunk boundaries.

Can you give a minimal working example (or illustrate with a figure)?

This image shows the memory usage of calling next on the file_reader.output_chunk. The four options are: current and proposed implementation, with or without zero ed interactions.

As you can see in the image, the proposed chunking logic reduces the memory usage by a factor ~2 and also improves the speed by a factor ~2.

cfuselli commented 6 months ago

Keeps failing with:

=========================== short test summary info ============================
FAILED tests/test_FullChain.py::TestChunkedFullChain::test_PMTResponseAndDAQ - 
numpy.exceptions.DTypePromotionError: field titles of field 'pulse_id' mismatch

checked outputs and seems to be the same as before the PR, also for empty chunks it seems to give the same result. Not sure what is causing the issue here. Giving up for today..

coveralls commented 5 months ago

Pull Request Test Coverage Report for Build 9452590987

Details

24 of 26 (92.31%) changed or added relevant lines in 1 file are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.05%) to 78.37%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
fuse/plugins/micro_physics/input.py	24	26	92.31%
<!--	Total:	24	26	92.31%	-->

Files with Coverage Reduction	New Missed Lines	%
fuse/plugins/micro_physics/input.py	2	74.58%
<!--	Total:	2		-->

Totals
Change from base Build 9451990213:	0.05%
Covered Lines:	2337
Relevant Lines:	2982

💛 - Coveralls

cfuselli commented 5 months ago

Fixed some mistakes, now tested again. Based on my tests, there is no difference at all between the chunks produced before and after the PR. I also tested again the speed and memory usage, and updated the plot in the description. Results are the same, x2 memory efficiency, x2 processing speed.

XENONnT / fuse