cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
63 stars 267 forks source link

Use chunked loading in `ctapipe-train-*` tools #2413

Closed maxnoe closed 9 months ago

maxnoe commented 10 months ago

Please describe the use case that requires this feature.

At the moment, the ctapipe-train-... tools use TableLoader.read_telescope_events to load all telescope events for a given telescope type in one go.

This potentially uses large amounts of memory given that we

Describe the solution you'd like

Load data in smaller chunks, applying the event selection and column selection for each chunk and then merge chunks into the needed big training table to reduce overall memory usage.

kosack commented 10 months ago

For the quality criteria: pytables has efficient filtering (table.where()) that could also be used to filter events before creating the astropy tables and even before chunking, but that would require some lower-level changes to how data are read and I'm not sure the added complexity is worth it.

maxnoe commented 10 months ago

We already support that in read_table: https://github.com/cta-observatory/ctapipe/blob/7d32c650ffeb580b5923b6a5de708a25af92f27c/ctapipe/io/astropy_helpers.py#L89-L94

and it is used to filter the telescope trigger table by tel_id in the TableLoader.