Use chunked loading in `ctapipe-train-*` tools

maxnoe commented 10 months ago

Please describe the use case that requires this feature.

At the moment, the ctapipe-train-... tools use TableLoader.read_telescope_events to load all telescope events for a given telescope type in one go.

This potentially uses large amounts of memory given that we

Apply quality criteria that will throw away a significant percentage of the events
Only use a subset of the available columns
Sub-sample events if n_events or n_signal / n_background are configured.

Describe the solution you'd like

Load data in smaller chunks, applying the event selection and column selection for each chunk and then merge chunks into the needed big training table to reduce overall memory usage.

kosack commented 10 months ago

For the quality criteria: pytables has efficient filtering (table.where()) that could also be used to filter events before creating the astropy tables and even before chunking, but that would require some lower-level changes to how data are read and I'm not sure the added complexity is worth it.

maxnoe commented 10 months ago

We already support that in read_table: https://github.com/cta-observatory/ctapipe/blob/7d32c650ffeb580b5923b6a5de708a25af92f27c/ctapipe/io/astropy_helpers.py#L89-L94

and it is used to filter the telescope trigger table by tel_id in the TableLoader.

cta-observatory / ctapipe

Use chunked loading in `ctapipe-train-*` tools #2413