cta-observatory / dl1-data-handler

DL1 HDF5 format IACT data writer + reader + processor
MIT License
11 stars 18 forks source link

use more ctapipe functionality #31

Open kosack opened 5 years ago

kosack commented 5 years ago

I haven't been following the development of this code much, but looking quickly at the code, it seems to re-implement a lot of what is available in ctapipe, specifically the functionality of the HDF5TableWriter (which turns a ctapipe Container class into a row of a HDF5 table automatically). The ctapipe io system was designed exactly for the use case of writing DL1 data, so shouldn't be forgotten. We've been busy making it more useful recently as well, and there are still a few issues out for increased functionality.

I suspect much of this code could be simplified by using it (probably by >50%), and it gives you the advantage of keeping physical unit info, metadata, etc, which will be required in CTA output files.

I'd suggest:

bryankim96 commented 5 years ago

Karl,

Thanks for the suggestions! Our goal is definitely to bring the code fully in line with ctapipe and use existing functionality as much as possible to avoid duplicating the large amount of work that the ctapipe devs have done.

Regarding your points 2, 3, and 4, all of these sound like they would simplify things considerably and make them more consistent with the ctapipe codebase. I think it should be no problem making these changes.

Regarding point 1, I think this might relate more to a larger issue with the ctapipe containers and the PyTables HDF5 format. From what I understand, the ctapipe container classes are all attribute-based/map-based, i.e. DL1Container.tel contains a variable number of keys which map to DL1CameraContainers, which contain the DL1 data for each camera. And the DL1CameraContainer contains an image attribute which maps to a numpy array of variable size, depending on the number of pixels in that particular camera.

As far as I can tell, it isn't trivial to represent this sort of structure in a database-like format like PyTables, where by definition the lengths/sizes of the columns are specified in the schema. To a certain degree, the tree-like structure of the ctapipe containers can be imitated with nested columns (we didn't use these in DL1DataDumper, but it might be worth considering), but if you have fields of variable length/size like DL1Container.tel or DL1CameraContainer.image, I think you still run into a problem of how to represent this. The obvious way seems to be to always pad columns to their maximum length, which can result in a lot of wasted storage space. Alternatively, you can place every telescope type in a separate table, which is what we did. However, this still leaves the issue of associating each "DL1CameraContainer-equivalent" row with the corresponding "DL1Container-equivalent" row. Trying to address this issue was the motivation behind designing DL1DataDumper with custom table definitions rather than a 1-to-1 correspondence between ctapipe container fields and PyTables columns. We tried to avoid the need to pad variable-length columns by introducing a key-like field in the "DL1Container-equivalent" table which maps to rows in "DL1CameraContainer-equivalent" tables (one for each telescope type). This way, within each table the field sizes are consistent and the wasted space is minimized.

I haven't taken a look at ctapipe.io.HDF5TableWriter for over a month, but from what I can gather, although it's possible to create almost the same structure as DL1DataDumper, it doesn't explicitly address this issue of associating DL1CameraContainers with DL1Containers while also handling variable column sizes.

Let me know if I'm misunderstanding how ctapipe.io.HDF5TableWriter actually works, or if there is some alternative way of avoiding this problem. I think it would be great if DL1DataHandler could be part of a discussion with ctapipe about fully formalizing/specifying an approach to HDF5 dumping.

I will also tag @aribrill and @nietootein here in case they have any thoughts to add as well.

TjarkMiener commented 2 years ago

This issue has evolved in migrating the DL1DH, {CT, Gamma}Learn modules to ctapipe components, but the title still suits well. I've assigned myself to this task.