cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
63 stars 266 forks source link

Discussion on restructuring Event Container structure #1165

Open kosack opened 4 years ago

kosack commented 4 years ago

Since it's come up a few times that things in the current Event container structure are a bit haphazard, let's start a discussion on what should move. For reference, here is what it is now:

kosack commented 4 years ago

Does anyone remember why event.mc.tel[x].photo_electron_image is a Map, not Field containing an ndarray? (any why the Map is not a map to any specific type, which seems to break things?)

dneise commented 4 years ago

I think the docu of tels_with_data could be updated to "set of tel_ids for telescopes with data"

watsonjj commented 4 years ago

Does anyone remember why event.mc.tel[x].photo_electron_image is a Map, not Field containing an ndarray? (any why the Map is not a map to any specific type, which seems to break things?)

I think this was just a mistake

watsonjj commented 4 years ago

Can you clarify again what is meant to be the difference between tels_with_data and tel.keys()?

watsonjj commented 4 years ago

Related issues: #1041 #722

watsonjj commented 4 years ago

We are still missing a place to store dl1b (hillas etc.)

maxnoe commented 4 years ago

I would switch hierarchies to make things clearer:

The toplevel should be an ArrayEvent containing meta information and a list or map of telescope events.

Telescope events have all the necessary telescope-wise information and maybe a backref to its array event if available.

This is a few advantages imho:

Telescope-Reconstruction algorithms like calibration and image parameterization can get just the telescope event and do their thing, no strange looping over tel_ids accessing multiple telescope-wise maps.

compare e.g.:

calibrate(data) # calibrate has an internal loop over all telescope events
for tel_id in data.tels_with_data:
    pointing = data.pointing.tel[tel_id]
    r0 = data.r0.tel[tel_id]
    dl0 = data.dl0.tel[tel_id]
    print(pointing, r0, dl0)

with

for telescope_event in array_event.telescope_events:
    calibrate(telescope_event)   # no loop over events, just calibrate this telescope event
    print(telescope_event.r0, telescope_event.pointing, telescope_event.dl0)

stereo_reconstruction(array_event)

Observed data event sources could then produce TelescopeEvents that are merged into ArrayEvents after or before lower-level processing for stereo reconstruction.

I haven't yet found a draw back of this and everything else seems to get much nicer.

This is also about data locality. It keeps data that belongs together closer together in the code, as opposed to keeping data of similar kind but from different location together.

kosack commented 4 years ago

@MaxNoe I agree . The old scheme was trying to follow the top-level data model naming scheme, which wasn't really meant to be a storage hierarchy. The only question for me is whether or not to swap the data level and telescope, since separating things by data level is still useful (since you don't need to pass DL0 info on to higher steps, etc). So another option (perhaps a bit more complex) is:

event.dl0.array.telescope (which is not far from what is there now, but perhaps can be simplifed)

as opposed to:

event.array.telescope.dl0.

I'm thinking mostly of the use case of parallelization, where you want to be able to send around the least amount of info to a remote process. So if I'm doing the reconstruction step, I only need to "scatter" the dl1 info to multiple processors, and I don't want to also keep the dl0 info. I suppose it depends on where the scatter operation happens, at the telescope event level, or array event level (certainly for reconstruction, it's at the array event level)

However, some thing like where to put data shared between data levels (pointing, MC shower info, etc) works better in the second case (your proposal), since you could have event.array.telescope.pointing. Otherwise, we'd need to have the pointing info in each data level (which isn't really a memory problem, since it's just a reference to the same data, but is perhaps confusing)

kosack commented 4 years ago

Can you clarify again what is meant to be the difference between tels_with_data and tel.keys()?

I think the rational was this:

dneise commented 4 years ago

I am also hugely in favor of Maxs proposal. In another issue @MaxNoe also mentioned the service approach used e.g. in fact-tools to get information, that is not naturally synced with the event timing, such as pointing. I think telescope events should also carry refs to their respective services. I did not follow this discussion closely but I assume these services would be accessed lazily, right?

kosack commented 4 years ago

fact-tools to get information, that is not naturally synced with the event timing,

There's another discussion about that in the various calibration issues, but no real resolution. What you are talking about is basically the difference between Monitoring data, and Event data that has been interpolated from Monitoring data. Monitoring data (a time-series of measurement that are not synced to the event stream, but rather taken once a second or so) is so far not treated in a nice way. But the idea (perhaps similar to the FACTTools services) is that you would have a Component that loads up monitoring data, constructs an interpolator, and writes the interpolated result to the event.pointing container (for pointing), or even the event.array.tel[x].calib.pedestal (or wherever we put that).

Therefore what is in the event are instantaneous values of those quantities, interpolated to the current event time. Other COmponents that use them do not need to know how they were interpolated (or even if they were computed on-the-fly like with pedestals, perhaps), just that the value in the Container is the right one for this event. Therefore they don't need to distinguish between data coming from a "service" or from an event stream.

Here's a sketch: image

kosack commented 4 years ago

For @MaxNoe's proposal, the question is wether it is better to have a single "event data" as shown above, or something like we have now (but cleaned up), where there is a clear separation of input and output data.

What we have now is like this:

image

Which means you could parallelize (or just serialize to disk) one step of the data flow, by data level. If we invert the tree, this is harder (even if nicer conceptually in memory).

What's of course missing in this diagram (and important), is where is the data that is shared between data levels?

dneise commented 4 years ago

where there is a clear separation of input and output data.

I'm sorry but I do not see the clear separation in the image above:

So what, I ask, can prevent me from writing a Component or accidentally misusing a Component, that reads from DL0 and writes back into DL0?

kosack commented 4 years ago

@dneise Yes, both your statement are correct. I'm just showing the current design, where everything within a single Tool reads and writes to one central data structure called "Event". There is nothing stopping a rogue component from modifying another's data (other than regression tests, etc). Are you suggesting another design? In the suggestion from @MaxNoe, it's even more linked, since the data levels cannot be separated into separate objects (and there can only be one central "array event" object shared with each stage of the analysis).

In principle we could implement a "owner" system to prevent writing to places where one shouldn't, but not sure it's worth the effort.

There are definitely good and bad points to this design, but so far it's not been a major problem. Especially if we later separate larger stages of analysis by on-disk files, then the ownership and input/output is clear (only in-memory it needs to be considered)

dneise commented 4 years ago

I was trying to point out that this statement here:

For @MaxNoe's proposal, the question is wether it is better to have a single "event data" as shown above, or something like we have now (but cleaned up), where there is a clear separation of input and output data.

seemed not entirely true to me. From my personal point of view, both Maxs proposal and the current solution are equal with respect to the question how linked the data levels are. I had the impression you were pointing out a downside of Maxs proposal and I wanted to point out that this is not a unique downside of Maxs proposal but of both solutions. I did not want to say this is an issue, that needs to be fixed.

kosack commented 4 years ago

Perhaps it's useful to look at the simplified data flow just from DL0 → DL1, right now the situation looks (something) like this, where the order of calling each component goes from left to right: image

(of course right now we don't do the monitoring interpolation, the realtime calibration coefficient production part. In #1163 , the image processing is 2 components and the writing is not yet split into a component. But this is the design we are working toward so far).

The question is: does this all still make sense, and is the container structure designed correctly? (clearly not so far).

It gets more complex when you start to think of further stages:

moralejo commented 4 years ago

@MaxNoe I agree . The old scheme was trying to follow the top-level data model naming scheme, which wasn't really meant to be a storage hierarchy. The only question for me is whether or not to swap the data level and telescope, since separating things by data level is still useful (since you don't need to pass DL0 info on to higher steps, etc). So another option (perhaps a bit more complex) is:

event.dl0.array.telescope (which is not far from what is there now, but perhaps can be simplifed)

as opposed to:

event.array.telescope.dl0.

I'm thinking mostly of the use case of parallelization, where you want to be able to send around the least amount of info to a remote process. So if I'm doing the reconstruction step, I only need to "scatter" the dl1 info to multiple processors, and I don't want to also keep the dl0 info. I suppose it depends on where the scatter operation happens, at the telescope event level, or array event level (certainly for reconstruction, it's at the array event level)

I like the "event.array..." approach (Max's) for the event container structure, but I am not sure I understand (probably I don't) why it poses a problem when sending around data of specific levels - the structure does not force one to fill it completely, right? Is it just that you expect it to be much less efficient if you have to strip it of all the unneeded data, e.g. DL0?

FrancaCassol commented 4 years ago

I also like @MaxNoe proposal, one could also think to have r1, dl0 and dl1 at the telescope level and the dl2, dl3 at the array level:

event.array.telescope.r1/dl0/dl1 event.array.dl2/dl3

Just, the "array" name is perhaps a bit misleading (given the numpy array), perhaps better "cta_array" or even only "cta" : event.cta.telescope.dl0 , event.cta.dl2 ...)

maxnoe commented 4 years ago

@FrancaCassol yes exactly. DL2 / DL3 is array event wise

maxnoe commented 4 years ago

Also, I don't mean event.array

I mean array_event as the top level container.

FrancaCassol commented 4 years ago

yes @MaxNoe, I was looking at Karl's naming scheme which reflets the standard data model categories. The intermediate level is useful if part of the telescopes are pointing to different targets. I don't know how the data streams will be handled then, but in the case they will not be totally separated, one could introduce the "observation/target/subarray" level (call it as you like):

event : with everything concerning the whole array (e.g. atmo, (?) ) event.obs[n] : with everything concerning a specific observation/target/subarray (e.g. event.obs[n].dl2 ) event.obs[n].tel[m] with everything concerning a specific telescope (e.g. event.obs[n].tel[m].r1)

kosack commented 4 years ago

event.array.dl2/dl3

There are still telescope-wise data at the DL2 level (but not DL3) - recall the discussion at the f2F ASWG meeting.

But really, we don't need any DL3 data in this structure - that won'y be done event-wise.

kosack commented 4 years ago

event.obs[n] : with everything concerning a specific observation/target/subarray (e.g. event.obs[n].dl2 ) event.obs[n].tel[m] with everything concerning a specific telescope (e.g. event.obs[n].tel[m].r1)

I don't think we should mix observation-level things in here - all processing is done for a single observation at most, with a single sub-array.

kosack commented 4 years ago

I'd like to keep this as close as possible to the top-level CTA data model, rather than invert it. The point of this conversation was more about how to fix places where we are not consistent, rather than fully re-designing a new data model. I see some nice points of going for "telescope → dl0/dl1..., subarray → dl0/dl1", but at the same time it makes writing and reading the data much more complex (since it will be stored closer to how it is organized now)

maxnoe commented 4 years ago

@kosack I don't think the CTA data model is meant or should be treated as something that should be exactly represented in code. It's about the general concepts and entities, not about how to layout storage in memory or in file.

Different use cases might prefer the one or the other hierarchy in memory or in a file. That does not change how they relate to each other.

I discussed with @FrancaCassol again, and I am much in favor now of using for the in memory representation for the preprocessing until DL2 a version that looks like this:

array_event
  mc
  ...
  telescope_events
      trigger_info
      mc
      dl0
      dl1
      dl2
  dl1
  dl2            

I think this as several key advantages:

but at the same time it makes writing and reading the data much more complex (since it will be stored closer to how it is organized now)

You are referring to single-telescope files? How is this easier with the structure you proposed? I see the same challenges for both, not a general difference.

I think we should make splitting and merging full telescope events easy. Currently this is hard, has you have to update lot's of telescope Maps. And one cannot easily see, if a map is really mapping telescope ids and not something else.

kosack commented 4 years ago

Yes, I think it might be clearer this way, and since the DL1 (and DL2) writer components I'm working on don't really care about the in-memory structure, reading and writing shouldn't be too hard to accomplish.

However, can you explain a bit your diagram/structure above? I'm not sure I get it. For example, trigger_info is something right now that is written by the SWAT, so it's array-level information. Or is that the telescope-trigger information? and then where does the array-trigger go?

Perhaps if you add one-level of detail lower, it will be clear to me.

From the standpoint of data acquisition you will have:

So an in-memory structure that doesn't require too much manipulation to support loading these is useful.

Also, where is the separation of monitoring and event data here? I guess this is all event (including monitoring parameters interpolated to events?). Would be nice to see also where non-interpolated monitoring info goes in this scheme (it's wasn't pretty in the old one either)

maxnoe commented 4 years ago

Or is that the telescope-trigger information? and then where does the array-trigger go?

Ah sorry, that was not really meant to be an exhaustive or even clear diagram. I just wanted to make the point array_event → list (or mapping) of telescope_events + array wide info and telescope_event → data_levels.

Both the array event and the telescope_event will have trigger info. Array with central array information and telescope with telescope trigger info (e.g. trigger type, time, fired trigger patches...)

kosack commented 4 years ago

OK, yes that makes more sense. So basically you have:

The same could be use for (non-interpolated) monitoring:

So you'd have something like:

event.tel[n].dl0.waveform
event.tel[n].dl1.image
event.tel[n].dl2.shower.energy # mono
event.subarray.trigger.tels_with_data
event.subarray.dl2.shower.origin
event.subarray.dl2.shower.energy
maxnoe commented 4 years ago

Monitoring is more service-like, isn't it?

So it will probably be an api like:

pedestal = monitoring.camera.get_pedestal(
    tel_id=tel_id,
    timestamp=tel_event.trigger.timestamp,
    strategy='closest',   # or earlier to only allow data known previously
)

and this then may be do some database lookup or just access the values stored in the simtel event source depending on if what kind of data you are analyzing.

kosack commented 4 years ago

Yes, that is what I intended for monitoring, in fact. However, you still need to write out some monitoring data (pedestals computed every few seconds, data quality parametersetc), so having a hierarchy would be nice for output. For input, you'd just load up an interpolater.

I think somewhere there is an issue (or presentation)? Where I basically wanted to do exactly what you show above for reading monitoring data. Internally, it just loads all the data with Table.read(), and constructs an interpolator. Then you just give the event time to get the real data and fill it into the event if needed.

kosack commented 4 years ago

One more question: how do we deal with the fact that the list of telescopes changes between data levels? At R1/DL0 you have all telescopes that triggered and read out. At DL1 you might only have parameters for a subset that passed cleaning. At DL2 you only have reconstructions for a further subset that had good parameters, etc. Right now, that is easy to deal with, but in this scheme, we would need to add a way to keep track of this, so you don't loop over things that are missing/dropped in the previous step.

maxnoe commented 4 years ago

Yes, this is the thing that is harder in the current scheme. However, with using proper missing value markers or flags, this should be easy, right?

Or just setting the respective dl1/dl2 containers to None if they weren't found in the file.

We probably should implement something to be able to disambiguate defaults from unset fields in containers. The current behaviour of setting the default for things not found in the input file seems dangerous.

kosack commented 4 years ago

perhaps defining a common place to set an algorithm-specific quality flag is one option, or setting the Container to None if it is not included (but that might cause problems for writing). In fact, this might help the problem that we might apply multiple algorithms (2 cleanings, for example), in which case the list per algorithm method got messy (you'd need 2 lists).

I just want to think about how algorithms will look after this change. For example, before we just could pass a list of HillasParameterContainers to a Reconstructor, which didn't need to know about the event structure at all. In this case, i guess you could just use list comprehensions to basically go back to the way we stored things before:

params = {tel_id: t.dl1.parameters.hillas for tel_id, t in event.tel.items() if t.dl1.parameters.hillas]
shower = reconstruct(params)
event.subarray.dl2.shower = shower

or you force all algorithms to know about the event structure (less ideal, especially if there are more than one algorithm that produces the output):

reconstruct(event.tel)

here a loop over tel events is made inside the Reconstructor, but then it need to know to get the parameters from tel.dl1.parameters.hillas. If there are 2 sets of hillas parameters from 2 cleanings (say tel.dl1.parameters_wide_cleaning.hillas, for example, it then becomes complicated.

kosack commented 4 years ago

Also, I guess when you say "list of telescopes", it's really a dict by tel_id, otherwise you have trouble connecting the telescope to the instrument description

kosack commented 4 years ago

Anyhow, I'd still like to see an example of how this makes the loops for cleaning, parameterization, and reconstruction cleaner than before, since I still feel like it requires a lot of bookkeeping, given that the algorithms should not be too tied to the event structure.

maxnoe commented 4 years ago

Yes, I think we have now seen that there are usecases that are easier and simpler in either format.

So without an actual implementation, discussions will probably not be able to solve this.

maxnoe commented 4 years ago

But especially from the single telescope's perspective, it would be nicer if they would just fill their TelescopeEvent container with all the relevant info and this could be merged later into an ArrayEvent.

This is also probably close to what real data taking will look like, right? First there are single telescope events that are then merged into array wide data structures.

kosack commented 4 years ago

Yes, I do like that the TelescopeEvent can be separated (at least at the R1-DL0-DL1 level)

vuillaut commented 4 years ago

One more question: how do we deal with the fact that the list of telescopes changes between data levels? At R1/DL0 you have all telescopes that triggered and read out. At DL1 you might only have parameters for a subset that passed cleaning. At DL2 you only have reconstructions for a further subset that had good parameters, etc. Right now, that is easy to deal with, but in this scheme, we would need to add a way to keep track of this, so you don't loop over things that are missing/dropped in the previous step.

I think that all events should be kept until actual selection at DL2 level, as you suggest with a reconstruction quality flag.

maxnoe commented 4 years ago

I think that all events should be kept until actual selection at DL2 level, as you suggest with a reconstruction quality flag.

It's not that easy.

"Keeping all events" is something else than "using all events for stereo reconstruction".

You can make all kinds of (telescope) event selections for the different reconstruction algorithms, which will probably be necessary to achieve a good performance.

I agree we should keep all telescope events, but we also need to have a way to do the reconstruction only on a subset of those events (more than N pixels, less than X leakage, etc).

So already at the DL2 level, you will have discarded some telescope events for some stereo reconstruction. (Or even Dl1 telescope event reconstruction in case of no or just a couple pixels after cleaning)

vuillaut commented 4 years ago

I think that all events should be kept until actual selection at DL2 level, as you suggest with a reconstruction quality flag.

It's not that easy.

"Keeping all events" is something else than "using all events for stereo reconstruction".

You can make all kinds of (telescope) event selections for the different reconstruction algorithms, which will probably be necessary to achieve a good performance.

I agree we should keep all telescope events, but we also need to have a way to do the reconstruction only on a subset of those events (more than N pixels, less than X leakage, etc).

So already at the DL2 level, you will have discarded some telescope events for some stereo reconstruction. (Or even Dl1 telescope event reconstruction in case of no or just a couple pixels after cleaning)

Yes, and I meant keeping them not using all of them. Those not passing cleaning, for example, are obviously not used. Yet they should not disappear between DL0 and DL1/2 but can be flagged as such.

maxnoe commented 3 years ago

After working on the LST EventSource again and all the calibration steps connected to that (e.g. time_shift), I came to the conclusion that the hierarchy

will be much easier to work with.

Right now, LST defines it's own LSTArrayEvent, since it needs out-of tree telescope-event-wise data. I think it will many things much cleaner and I don't yet see a real downside.

Especially the usecase of merging the different telescope streams will be much easier.

kosack commented 1 year ago

@maxnoe In the inverted hierarchy above, can you expand the example to show where all subarray info as well as monitoring goes? I think it will work well, but would be nice to describe it in more detail.

Some other nice points: