cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
62 stars 266 forks source link

Propose CEP-4: module structure #2489

Open maxnoe opened 6 months ago

maxnoe commented 6 months ago

Hi all,

I made a first rough draft of CEP-4, to address the module structure. Please comment on the proposed modules.

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (8d61c03) 92.47% compared to head (46430c8) 92.47%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2489 +/- ## ======================================= Coverage 92.47% 92.47% ======================================= Files 234 234 Lines 19965 19965 ======================================= Hits 18462 18462 Misses 1503 1503 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ccossou commented 5 months ago

I've read the document but my understanding could make my comment not very usefull (take it as what a new developper would see if it looked at the ctapipe structure (old and new). My concern is about the different categories. It's difficult to understand, from the list alone, the hierarchy of the various tools. I know this is not the current plan, but I would have added a layer of sub-module to further categorise the functions based on the datalevel they are supposed to use as input, i.e (waveform functions, image functions, other?). From a quick look at the structure, it's not easy (at least for me) to understand what type of analysis and input data I am expecting in the function inside).

It is fine for me to keep a sub layer, even in core, so as to have as little sub-module in the top level. Tool and Component, in particular, are part of core, so either in a core.config, or in .core.

Does it make sense to rename .processors into just .api or something to clearly identify that inside we'll find the high level functions?

To rephrase my comment, if I take the list of new submodules, I don't really know what to expect in the following submodules .atmosphere: is it processing related to atmosphere, or classes to define or read an atmosphere file? .coordinates: I wouldn't mix a submodule about coordinates with another submodule with processing functions about cleaning,... .data_volume_reduction: this is related to IO to me, and should be, along IO function, in an IO submodule .gain_selection: what level are we talking about, waveform? .instrument: if we define instrument type, I would put that into core .processors: this is a high level API, but Tool are also in the API I think and could be regrouped .reconstruction: I assume it's machine learning but it's a bit generic, what's the idea behind this category, we put every machine learning code in it? Even if they are all reconstruction models, in practice, they require different data level and are not interchangeable, so at the very least I would split further in that submodule to reflect the hierarchy of the different models. .tools: the word tools usually mean it's a submodule to regroup random things compared to the rest in general (like utils), while here it's the top level interface and should be visible in a quick glance .visualization: is it for plot? specific kind of plot, or any plot like visualisation of telescope shape, image, etc...?

My general idea would be to have a very high level organisation for the first level, this would help to immediately visualise what every sub-module of each section will be about. e.g. data_model/muon is a muon datamodel, but io/muon is about reading or writing muons (just a random useless example): .core config .data_model .io data_volume_reduction

.api: everything that can be called from the command line) processors tools .steps dl0: not pushing hard for this list of names... dl1 dl2a dl2b dl2c ...

My view is external to ctapipe still. I haven't worked enough on ctapipe to have a clear understanding of it just by looking at the code structure, and these are the comments that comes to mind when I look at it, judging from other code base I've seen before.

And I wouldn't be afraid of adding a lot of folders, so that each folder can be imported separately as an entity. I would add a folder for each analysis step for instance, and each high level processor so that you can further split into sub files if needed, and not have one gigantic file with multiple things in it.

For instance the pipeline I know is like this: ├── core ├── datamodels │   ├── datamodels │   │   ├── metaschema │   │   ├── schemas │   │   └── tests │   └── tests │   └── schemas ├── lib ├── pipeline │   ├── detector1.cfg │   ├── detector1.py │   ├── init.py └── steps ├── dark_current │   └── tests ├── dq_init │   └── tests ├── gain_scale │   └── tests ├── group_scale │   └── tests ├── ipc ├── jump │   └── tests ├── linearity │   └── tests ├── ramp_fitting │   └── tests │   │   ├── test_ramp_fit_cases.py     │   ├── test_ramp_fit.py    │   └── test_ramp_fit_step.py ├── refpix │   └── tests ├── saturation │   └── tests └── superbias └── tests

the pipeline submodule are the high level interface, they just reuse steps that are defined in the steps submodule. Each step (see the ramp_fitting folder that I did not strip of its internal structure) need a *_step.py that define the main class, the rest is implementation specific and up to the developper. Every abstract class is defined in core (Pipeline, Step, ...) Every step is self contained in its folder. You only need to import pipeline if you want just the interface.

I'm not saying this is the best view, but I figured I could provide another type of input, in the hope it's usefull.

Best, Christophe