cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
65 stars 269 forks source link

Add possibility to append to existing file for DataWriter / ctapipe-process #2663

Open maxnoe opened 5 days ago

maxnoe commented 5 days ago

Please describe the use case that requires this feature.

The current processing on the GRID runs multiple ctapipe-process processes for multiple input files and then merges the resulting small files. This is done because the jobs would other wise be to short to be efficient with the GRID job submission system.

It would be more efficient to not process and then merge but to directly append to a single output file.

Describe the solution you'd like

Add possibility to DataWriter and ctapipe-process to append to an existing outputfile.

Alternatives considered

Create a tool / modify ctapipe-process to directly run on multiple input files.

kosack commented 5 days ago

To be clear: I assume you don't mean simultaneous writing from many jobs to one, as that is what we do on the grid, but rather within one grid job allowing multiple input files.

Really this has nothing to do with the grid processing, just more about allowing multiple EventSources to be chained (like itertools.chain) and have DataWriter correctly write the configuration data when the input changes. Mainly that means just to re-run DataWriter._setup_outputfile() when the obs_id changes (which I think is the minimal way to detect a new EventSource in the stream). DataWriter on it's own doesn't know about input files, only the "event" structure in memory, but it currently assumes that no "header" info changes event-to-event.

maxnoe commented 5 days ago

Really this has nothing to do with the grid processing,

Technically no, but this is where the motivation comes from.

DataWriter on it's own doesn't know about input files, only the "event" structure in memory, but it currently assumes that no "header" info changes event-to-event.

Yes, which is why it might be easier to just support running ctapipe-process multiple times, with the same output file and introduce an --append option to not overwrite / require non-existence of the output file