cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
64 stars 268 forks source link

Discussion on separation of Event, Monitoring, and Service/header data #1041

Open kosack opened 5 years ago

kosack commented 5 years ago

The Problem:

The concept of an EventSource has been quite successful and easy to use in ctapipe, since it allows one to write clear loops over events, where the event data is part of an iterable object. However, we now need to think about other data streams that are not tied directly to each Cherenkov event.

When processing data, we don't only need the data for an event, but several other pieces of information

These info come at different rates than the events (much more slowly). However, some form of synchronization to the event data is needed.

Special design considerations

for Monitoring data

The monitoring (MON) data comes rates around 1HZ, but it is not necessarily correct to simply synchronize the "latest" MON data with the current event: normally one would like to do some interpolation, so how the monitoring data is accessed may need some thought. This means there may be two types of MON data:

These do not need to be stored in the same way, e.g. one might first read the raw MON data, construct an interpolator, and then use it to fill event.mon info (event-wise interpolated MON data)

Note also that there are also two further types of MON data:

For Service Data:

In general these are all static information for a given observation, but in the case of Monte-Carlo files, it is possible for these to change in the middle of a file (e.g. if multiple runs are merged). The question is whether or not we support these changing for a single EventSource loop, or assume they are header info that is read once at the start.

Parallelization and distributed processing

Right now parallelizing the event loop (e.g. some_parallel_system.map(processing_function, source)) means sending around all of the event, mon, and service data every time an event changes (even though the mon and service info doesn't change). This is probably very inefficient, since that is a lot of data to transmit. Separating the streams makes this much cleaner.

Implementation Options

Most of these options involve generalizing EventSources, e.g. to have the EventSource also synchronize MON and SVC data.

This could be done in a few ways:

Option 1

a higher-level DataContainer that contains data.event, data.mon and data.service splits (e.g. see #1040). One issue with this is that semantically we are no longer looping over "Events", but we could stipulate that the event part is the one to which all others are synched, so it still make some sense, even if it's clunky.

for data in source:
     print(data.event.mc.energy) 
     print(data.mon.tel[i].pointing.az) # this is just the  "latest?"
     print(data.event.mon.tel[i].pointing.az) # interpolated
     print(data.service.subarray.tel_coords)

Option 2

we keep only the event part in the iterator loop of the EventSource, and have the svc data accessed from the source itself

for event in source:
    print(event.mc.energy)
    print(source.service.subarray.tel_coords)
    print(source.mon.tel[i].pointing.az(event.time))
    print(event.mon.tel[i].pointing.az) # interpolated to event time

In this case, it's not clear to the user that the monitoring and service data are synced. E.g. the the user needs to re-read the service data if it changes.

Option 3

we always return all three data streams separately:

for event, mon, svc in source:
    print(event.mc.energy)
    print(mon.tel[i].pointing.az)
    print(svc.subarray.tel_coords)

Here, it's clear that you can pass each one to further algorithms, and you might get a pointer to the same info , or new info, and it shouldn't matter.

Option 4

we handle MON and SVC data separately (probably a bit of an ugly solution, and not clear if this would work for MON data generated during the event loop)

mon_source = MonitoringSource().from_url(x)
event_source =  EventSource().from_url(x)
svc_source = ServiceSource().from_url(x)

Here, it's not quite clear what the event loop(s) should look like. I guess only EventSource would provide an iterator, and the others would just be interpolators or static info.

Any other ideas?

watsonjj commented 5 years ago

One option I have is that the EventSource is no longer responsible for creating the DataContainer. Instead it is part of the chain responsible for filling the DataContainer, as are other Components. This makes this option somewhat a combination of option 1 and option 4.... but its probably actually close to something you intended with option 4.

data = DataContainer()  # Create the container
event_source = EventSource().from_url(x)
mon_source = MonitoringSource().from_url(y)
svc_source = ServiceSource().from_url(z)
for _ in event_source:
   event_source.load_event(data)  # Fill the DataContainer with the event
   mon_source.load_event(data)  # Fill the DataContainer with the monitor information corresponding to the event
   svc_source.load_event(data)

In this implementation, the MonitoringSource and ServiceSource requires the event information to be loaded into the data container before the monitor information corresponding can be loaded (+ interpolated) into the DataContainer. Perhaps moving the DataContainer creator outside of the EventSource then does not matter.

I guess the idea here is that the highest rate the DataContainer is updated is defined by the event rate, therefore the iterator should be over the event source, and the other sources act accordingly to the current event.

The advantage of this over option 1 is ONLY that the class responsible for filling the event information is separate to the one responsible for filling the monitor information. Though really the same could be achieved with a higher level class that contains all three of the data-filling classes.

kosack commented 5 years ago

I guess in that example, it might be nicer to have the "load_event" happen automatically in the event_source iterator, but the others be explicit (since otherwise what is the point of the iteration)

watsonjj commented 5 years ago

I am not a fan of option 2 - accessing monitor information from a component becomes difficult, unless its explicitly passed. And has the potential for unclarity.

kosack commented 5 years ago

Though I don't like that the user has to do so much just to get the data, it should be made as fool-proof and automatic as possible. However, I do like the idea that filling monitoring info could be just a Component (which could interpolate or otherwise)

kosack commented 5 years ago

I am not a fan of option 2 - accessing monitor information from a component becomes difficult, unless its explicitly passed. And has the potential for unclarity.

agreed - it's also not so easy to allow for changing runs

watsonjj commented 5 years ago

Though I don't like that the user has to do so much just to get the data, it should be made as fool-proof and automatic as possible. However, I do like the idea that filling monitoring info could be just a Component (which could interpolate or otherwise)

Then it sounds like the way to go with this option should be:

class DataFiller:  # Name TBD
   def __init__(self, event_url, mon_url):
      self.event_source = EventSource(event_url) # Class responsible for only filling the event container
      self.mon_source = MonitorSource.from_url(mon_url) # Class responsible for only filling the monitor container
      # ....other sources

   def __iter__():
       data = DataContainer()
       self.event_source.fill_container(data)
       self.mon_source.fill_container(data)  # Interpolates according to the current event in the container

data_filler = DataFiller(event_url, mon_url)
for data in data_filler:
        print(data.event.mc.energy) 
        print(data.mon.tel[i].pointing.az) # this is just the  "latest?"
        print(data.event.mon.tel[i].pointing.az) # interpolated
        print(data.service.subarray.tel_coords)

Alternatively the DataFiller could be defined as:

class DataFiller:  # Name TBD
   def __init__(self, event_url, mon_url):
      self.event_source = EventSource(event_url) # Class responsible for only filling the event container
      self.mon_source = MonitorSource.from_url(mon_url) # Class responsible for only filling the monitor container
      # ....other sources

   def __iter__():
       for data in self.event_source:
           self.mon_source.fill_container(data)  # Interpolates according to the current event in the container
           yield data

As it doesn't seem to make sense to skip the event_source step anyway. Unless we had simulators we may wish to use....

FrancaCassol commented 5 years ago

Hi, it seems to me that an import element which steers the choice is the mechanism used to synchronise the event and mon containers (svc is probably not going to change inside a run). The simpler solution at the moment (where not data base is available) is to connect them by timestamps and to perform the connection at the event generator level: each event must know the time stamp of its correct mon container and in case the present mon container is outdated the generator is simply reading the correct one, that (in absence of a fast random file access mechanism it should be just the next mon event in the file).

So, given the present situation (where archive tools are still rather absent), I would prefer to leave a bigger role to the event source:

So probably this is more option 1?

kosack commented 5 years ago

One thing we can start is by just agreeing what the monitoring data is: is it per-event (e.g. some slower monitoring stream that has been interpolated to be the right value for an event) event.mon, or does it change slowly? data.mon

I kind of prefer the event-wise monitoring, since we can then read the "slow" monitoring stream in any way we like, and just update the event version for each event using a Component. In that case, we only have event.mon (as now), and it really is different per event. If we want to write it out again, of course, some down-sampling in time would be necessary.

I could imagine the loop looking like:

# reads a file and just constructs an interp function for each telescope
pointing = PointingInterpolator.from_file("tracking.hdf5") 

# Pedestals are computed on-the-fly, not from a file, so each time it is called, it might
# measure some data from the event, update the pedestals, and output the latest per-event version
pedestals = PedestalMeasurer()

for service, event in event_source:
    event.mon.pointing = pointing(event.time)
    event.mon.pedestals = pedestals(event.time, event.dl1)
kosack commented 5 years ago

For things like the pedestals, we also have to think of adding an event buffer of some kind, so it can essentially be n events ahead of the reconstruction (e.g. so that it can already gather enough statistics to compute the pedestals for the first few events). Or we always do multi-pass processing, where we first compute pedestals and gains, and then in a second pass read them in (as with the pointing example above), and loop a second time to do apply the calibration and do the reconstruction.

I think the multi-pass system is what France has implemented already, but it wouldn't be too hard to do both (but supporting on-the-fly computation of calibration of course adds complexity)

watsonjj commented 5 years ago

The on-the-fly approach means we do not need to try and match the calibration to the events, and we instead rely on the latest calculated values. It also avoids having to have two loops through the data stream (assuming the calibration events are interleaved among the cherenkov events).

I believe the workflow Franca has been working on uses the on-the-fly approach... she utilises the concept of interleaved calibration (pedestal & flat-field) events, and updates the calibration coefficients as she progresses through the observation. She uses a flag inside the event container to identify the type of event.

In Franca's design, there is also the possibility of creating a different calibration-calculator subclass that instead reads the correct calibration coefficients from a file (perhaps created with a separate pipeline, i.e. the two-pass approach).

In the case of a on-the-fly calibration, it might make sense that at the start of each observation, a bunch of calibration events are taken, to build up the statistics of the calibration. This means the same, single pipeline can be relied upon. Alternatively, the class loads the latest coefficients to begin with. The great thing about Franca's design is we are not restricted in how the calibration is performed - as long as it obeys the principle of filling the mon container for observation events, so that they can be applied.

For things like the pedestal and flat-field (and other calibration parameters), I don't think we should force it to be written to the event/data container at the start of the pipeline. This would complicate how we handle the calibration. These coefficients depend on factors such as the charge extraction method that is used. Therefore, it would be better that the coefficients are calculated via following the normal pipeline, up to charge extraction, and then uses the result of the charge extraction to populate the calibration coefficient calculation classes. Ensuring compatibility.

The following is how I imagine a on-the-fly calibration would work:

  1. The calibration events are identifiable with a flag
  2. The calibration events follow the same pipeline as observation events
  3. The last stage of the pipeline for a calibration event is filling the calibration calculator class, after the charge extraction step
  4. If the event is instead an observation event, the calibration coefficients are instead stored into the data container by the calibration calculation class. This event does not contribute to the coefficients itself.
  5. The observation event continues down the pipeline, where a separate class applies the calculated calibration coefficients stored in the event container.

Similarly for a two pass calibration, the calibration events would follow the same pipeline until charge extraction. Then the values are extracted and stored.

This approach ensures that the correct calibration coefficients are used for the specified pipeline configuration. I think it also makes the pipeline easy to follow, and reduces the chance for mistakes.

Regarding the decision between event.mon or data.mon, I am more inclined to the second. It may be that the calibration coefficients are updated at the same rate as event...but only if the telescope chooses to do such a thing as interpolate or utilises part of the waveform without signal. There is a possibility that the telescope chooses to only update the calibration every second. As they have deemed this as enough and see nothing gained with the complication of interpolating between calibration coefficients. There is nothing wrong with the data.mon being updated at the same rate as data.event, but it is not something we should force - we should stick to the original reason for the separation of event and mon, which is that there is the possibility of them being updated at different rates.

FrancaCassol commented 5 years ago

Thanks for this precise description @watsonjj, indeed this is what we are planing for LST online calibration. In this discussion I think we should remember that calibration will be an iterative process: the on-fly calibration is meant for feeding the RTA pipeline and producing the first DL0/DL1 data to be written on disk. Most probably, there will be afterwards a more fine calibration based on a multi-pass process and interpolation for a optimised quality. Therefore, we should think to a data model that is suitable for both approaches.

My major concern at this point is that for the prototype data we should be able (now!) to easily connect the calibration data to the event data without having the possibility to inquire an archive system that is not available yet. Therefore, whatever we chose we should think that the connection will be done (at least for a while) at the ctapipe code level. For that, as a first guess I thought to give to the event_souce the role of "connector" and then option 1 seemed the most natural to be used. But the DataFiller proposed by Jason above (second option?) will also play the role. The message here: together with the data model we should think that we need a data level "connector" in order to build something immediately usable by the prototypes.

FrancaCassol commented 5 years ago

Hi, starting again this discussion on the base of the example proposed by Jason above (that seems a good option to me):

class DataFiller:  # Name TBD
   def __init__(self, event_url, mon_url):
      self.event_source = EventSource(event_url) # Class responsible for only filling the event container
      self.mon_source = MonitorSource.from_url(mon_url) # Class responsible for only filling the monitor container
      # ....other sources

   def __iter__():
       data = DataContainer()
       self.event_source.fill_container(data)
       self.mon_source.fill_container(data)  # Interpolates according to the current event in the container

data_filler = DataFiller(event_url, mon_url)
for data in data_filler:
        print(data.event.mc.energy) 
        print(data.mon.tel[i].pointing.az) # this is just the  "latest?"
        print(data.event.mon.tel[i].pointing.az) # interpolated
        print(data.service.subarray.tel_coords)

The DataFiller could be a plugin Component as the EventSource class now, so that for raw (or specific) data analysis each telescope can define its own DataFiller and eventually DataContainer (inherited by the general DataContainer), e.g:

class LSTDataContainer(DataContainer):
    """
       Data container including LST information
    """
        lst = Field(LSTContainer(), "LST specific Information")

This would permit an immediate use of this approach with the prototype data. For MC data and higher data levels a more telescope independent DataFiller should be provided directly in ctapipe. What do you think?

moralejo commented 5 years ago

A couple of comments:

FrancaCassol commented 5 years ago
  • Franca, what is the part you want to move to a telescope-specific plugin component? Just the generation of the "data.mon"? The interpolation (or any other simpler approach to fill mon.event with reasonable values) should be part of ctapipe, right?

@moralejo, I think you mis-understood me. I will reply to the question explaining which is my problem right now: I would like to treat MC calibration data (flat fields and pedestals) in the same chain of the real data. For that the mon.data containers must not be initialised by the event reader (as it is now in the LST io plugin) because the event reader of the MC data is different from the one of the real data. Therefore we need a DataFiller (or an other solution) that creates and fill the data model choosing the right sources, which depend both from data type (MC or real) and telescope type. This DataFiller could be a Component class in ctapipe (and each telescope should probably create its own subclass), or a plugin in the ctapipe_io of each telescope, I don't know what is better...

The point here is: for LST (and NectarCAM, which will take data in June) is urgent to converge rapidly with this discussion and start to implement the solution that we think the best. This will permit also to push in clean environment the monitoring containers in ctapipe and allow their use, test and improvement by other telescopes.

moralejo commented 5 years ago

I would like to treat MC calibration data (flat fields and pedestals) in the same chain of the real data. For that the mon.data containers must not be initialised by the event reader (as it is now in the LST io plugin) because the event reader of the MC data is different from the one of the real data. Therefore we need a DataFiller (or an other solution) that creates and fill the data model choosing the right sources, which depend both from data type (MC or real) and telescope type.

Ok, thanks for the clarification. If we follow the same approach as for the "event" data, then the creation and (partial) filling of the standard mon containers for the case of the MC should be done by pyeventio, and not by a telescope-type-specific plugin. Or do you prefer to separate the creation of the mon containers from the io machinery? Probably for the MC that would be a complication, since the mon info (including calibration mon info, in the case you want to use the one provided by simtelarray instead of calculating it) is read from the simtel files.

This DataFiller could be a Component class in ctapipe (and each telescope should probably create its own subclass), or a plugin in the ctapipe_io of each telescope, I don't know what is better...

For the real data you say that the mon containers are initialized by the LST io plugin, and that should still be the case. What is missing, I guess, is to add the code to fill them (with whatever is not to be calculated during the analysis) - once we will know from where we will retrieve e.g. the LST-1 pointing information etc.

FrancaCassol commented 5 years ago

Probably for the MC that would be a complication, since the mon info (including calibration mon info, in the case you want to use the one provided by simtelarray instead of calculating it) is read from the simtel files.

To clarify again: what I want to is to test the LST calibration procedure with MC calibration data (flat-flied and pedestals), therefore I need to access and use the mon containers with MC calibration data exactly as I do with real data (even if obviously in the case of MC data the events will be read by pyeventio).

For the real data you say that the mon containers are initialized by the LST io plugin, and that should still be the case. What is missing, I guess, is to add the code to fill them (with whatever is not to be calculated during the analysis) - once we will know from where we will retrieve e.g. the LST-1 pointing information etc.

For the moment the mon containers ARE in the LST plugin, so we don't have other choses. When they will be moved in ctapipe, we can indeed keep the initialisation in each telescope plugin by mean of a DataFiller plugin or similar. @kosack what do you think? Could we perhaps plan a discussion on this point the next ctapipe meeting (after the next ASWG)?

kosack commented 4 years ago

if we want to follow the current data model I presented in Bologna, we should really separate four things:

  1. metadata headers:
    • simulation (what is in the MCHeaderContainer + other simulation header info, like MCThownShowerDistribution, etc)
    • observation (right now we don't have this in sim data, but we will need it)
    • instrument: at least the SubarrayDescription, other things?
  2. event data
  3. monitoring data (things that change far slower than events, but still change)
  4. service data (things that do not change during an observation, but may be needed for analysis)

As discussed a bit in #1041, at least some of this should be loaded automatically by the EventSource (at least event + the header info, perhaps not the monitoring or service by default since those can be loaded later on demand).

So if you want to implement all that here, let's decide on what it should look like. i.e.:

source = EventSource(input_url=x)   

already here, the input can be a single file (events), or in the case of most prototypes and probably all real data, a set of files: either we assume there is one "index" file that contains links to others, like in gammapy, or we accept a directory here, or assume there is always a single file and we use the features of e.g. HDF5 to generate a file that contains linked datasets. How does LSTChain or NectarChain handle this?

Then there is the use of source:

source.instrument.peek()  # would be nice to be able to do this before the event loop
print(source.simulation.mcheader)
print(source.simulation.thrown_shower_distribution) # in SimTel at end of file only, so not easy to do. In the HDF5 DL1, this could work though
print(source.observation) # implement only when we define what is in a Observation

That means in SimTelEventSource having perhaps to do some trickery to make sure it's loaded, but shouldn't be hard.

Defining classes that need subarray info, either at construction time, or call time?

calibrate =  CameraCalibrator(..., subarray=source.instrument)?

The event loop itself:

for event in source:
      event = calibrate(event)
      event = parameterize(event) 
      # or 
      event = parameterize(event, subarray=source.instrument)

Which also brings up: is there anything else in instrument than the subarray?

maxnoe commented 4 years ago

In fact-tools, we have a Service approach to this. These are basically classes that provide information given some property. E.g. GainService gets a timestamp and returns gains for each pixel.

This would roughly look like this:

mon_service = MonService(data_files)
calibrator = CameraCalibrator(mon_service=mon_service)

for array_event in source:
    for tel_event in event.telescope_events:
        calibrator(event)

And inside __call__, CameraCalibrator would do something like this:

gains = self.monitoring_service.get_gains(telescope_event.telescope_id, telescope_event.timestamp)