Open kosack opened 5 years ago
Karl,
Thanks for the suggestions! Our goal is definitely to bring the code fully in line with ctapipe and use existing functionality as much as possible to avoid duplicating the large amount of work that the ctapipe devs have done.
Regarding your points 2, 3, and 4, all of these sound like they would simplify things considerably and make them more consistent with the ctapipe codebase. I think it should be no problem making these changes.
Regarding point 1, I think this might relate more to a larger issue with the ctapipe containers and the PyTables HDF5 format. From what I understand, the ctapipe container classes are all attribute-based/map-based, i.e. DL1Container.tel contains a variable number of keys which map to DL1CameraContainers, which contain the DL1 data for each camera. And the DL1CameraContainer contains an image attribute which maps to a numpy array of variable size, depending on the number of pixels in that particular camera.
As far as I can tell, it isn't trivial to represent this sort of structure in a database-like format like PyTables, where by definition the lengths/sizes of the columns are specified in the schema. To a certain degree, the tree-like structure of the ctapipe containers can be imitated with nested columns (we didn't use these in DL1DataDumper, but it might be worth considering), but if you have fields of variable length/size like DL1Container.tel or DL1CameraContainer.image, I think you still run into a problem of how to represent this. The obvious way seems to be to always pad columns to their maximum length, which can result in a lot of wasted storage space. Alternatively, you can place every telescope type in a separate table, which is what we did. However, this still leaves the issue of associating each "DL1CameraContainer-equivalent" row with the corresponding "DL1Container-equivalent" row. Trying to address this issue was the motivation behind designing DL1DataDumper with custom table definitions rather than a 1-to-1 correspondence between ctapipe container fields and PyTables columns. We tried to avoid the need to pad variable-length columns by introducing a key-like field in the "DL1Container-equivalent" table which maps to rows in "DL1CameraContainer-equivalent" tables (one for each telescope type). This way, within each table the field sizes are consistent and the wasted space is minimized.
I haven't taken a look at ctapipe.io.HDF5TableWriter for over a month, but from what I can gather, although it's possible to create almost the same structure as DL1DataDumper, it doesn't explicitly address this issue of associating DL1CameraContainers with DL1Containers while also handling variable column sizes.
Let me know if I'm misunderstanding how ctapipe.io.HDF5TableWriter actually works, or if there is some alternative way of avoiding this problem. I think it would be great if DL1DataHandler could be part of a discussion with ctapipe about fully formalizing/specifying an approach to HDF5 dumping.
I will also tag @aribrill and @nietootein here in case they have any thoughts to add as well.
This issue has evolved in migrating the DL1DH, {CT, Gamma}Learn modules to ctapipe components, but the title still suits well. I've assigned myself to this task.
I haven't been following the development of this code much, but looking quickly at the code, it seems to re-implement a lot of what is available in ctapipe, specifically the functionality of the HDF5TableWriter (which turns a ctapipe Container class into a row of a HDF5 table automatically). The ctapipe io system was designed exactly for the use case of writing DL1 data, so shouldn't be forgotten. We've been busy making it more useful recently as well, and there are still a few issues out for increased functionality.
I suspect much of this code could be simplified by using it (probably by >50%), and it gives you the advantage of keeping physical unit info, metadata, etc, which will be required in CTA output files.
I'd suggest:
ctapipe.io.HDF5TableWriter
to simplify most of the tables you create, and get rid of your custom table definitions (the Container classes should be sufficient, and if not we can define new ones). You can basically throw away all of the code indump_event
this way.Provenance()
system, which contains all of the input files, output files, and the machine info, python version, etc. instead of using a custom solution. It gives you all this info automatically as a JSON object or dict.event_source()
helper function) , and it will construct the correct one based on the file contents (e.g. if you open a simtel file, it checks the magic number and return a SimTelEventSource), so the user just needs to specify a filename