cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
65 stars 269 forks source link

Instrument information for custom file formats #600

Closed watsonjj closed 3 years ago

watsonjj commented 6 years ago

With the inclusion of custom file format readers, the need for a instrument information database is even higher. Information like the pixel positions are currently read from the hessio files, however these custom event file formats do not contain instrument information.

For targetpipe I get around this by storing the CHEC pixel coordinates in a numpy file, and reading that as part of the reader. However, this approach is not appropriate for every custom file format readers as multiple cameras may be using the same custom file format, but require different pixel positions.

This highlights the need to develop the handling of instrument information in ctapipe. I also believe the event files should contain a camera identifier (e.g. "CHEC") inside the file, so the correct instrument information can consequently be loaded.

watsonjj commented 6 years ago

An additional consideration may need to be taken concerning pixel positions.

The ordering of pixel positions can change. The order of pixels in simulations of CHEC are different to that of the real data. The simulations will be updated to reflect the true pixel ordering in the future, but then if the information is read from a database, old MC files will have the incorrect pixel mapping applied.

Additionally, the hardware of the cameras may be updated which could change pixel mapping.

I have thought of two ways this problem can be avoided:

  1. The pixel positions for the run file are stored inside each file (as they are done for MC at the moment). Therefore the correct mapping is always used. However this is a huge duplication of data.
  2. The database has a system of versions for the instruments. But it would seem that the version can depend on date and on data source (MC or real). And each file should contain a keyword that correctly represents the version of the data inside.
calispac commented 6 years ago

I prefer solution 2. It provides hardware info without the need of an event while avoiding duplicates. The Monte-Carlo people could use these files in order to fix the inconsistence in pixel mapping. Indexing by version seems correct. The SST-1M zfits writer is using a config file for the mapping. We plan to have the version of this file written in the event data header.

calispac commented 6 years ago

Concerning the implementation of hardware descriptors. Since there are various cameras in CTA I guess making a unique CameraDescription class is quiet challenging.

  1. So should this class be an abstract class from which we should define what all cameras have such as? :

    • Pixels (id, vertices, coordinates)
    • Modules (id, vertices, coordinates)
    • Patch/ Trigger pixels (id, vertices, coordinates)

    A module could be a hardware grouping of pixels. E.g. FADCModule (group of pixels from the same FADC board) or e.g. PDPModule (group of pixels that share the same pdp parts). Patch/Trigger pixels is a grouping of pixels used for the trigger decision.

  2. Or should we let the camera teams to write there own CameraDescription in which they all should read from the same file format ?

kosack commented 6 years ago

Well, from a pipeline standpoint, we don't really care about camera modules. In the data model that was originally made in the old Data Management group, those things were separated. There was CameraGeometry (with just what is needed for reconstruction) and something like a CameraHardwareDescription that gave the other details. We should have one of those. I'm against having it a class hierarchy though - just a separate class is fine.

I'm not sure that has to be very camera-specific:

Do you really need to know the vertices of a module, if you know which pixels belong to it? Same for patches? You have the pixel positions already.

kosack commented 6 years ago

We could also go to deeper level, if needed (e.g. for the MC Configuration Builder), where we have a set of relationships (not a hierarchy) that defines e.g. a CameraTriggerDescription and a CameraModuleDescription, with keys being the pixel_id, module_id, and patch_id that link all quantities between them and the CameraGeometry, and then they can contain other info like thresholds, etc. That was the original idea in the ObsConfig model.

If that becomes useful, in ctapipe, we can change the subarray description hierarchy (again, not a class hierarchy, but composition) to be:

That way you can start from any point and access other info: from a trigger module, access which pixels in the CameraGeometry are in them, or go from CameraGeometry to which module is associated with each pixel, etc...

Here's the relevant part of ObsConfig in UML form:

image

calispac commented 6 years ago

Indeed vertices is not really needed except if you want to display something else than cherenkov images. I don't know if this belongs here but for instance one could :

  1. display the trigger rate by trigger patch (to detect noisy components for instance).
  2. display some slow control data that is a single value by group of pixels. (e.g. temperature over a PDP module)

There is some functions in scipy that for a set of vertices gives the outer vertices. https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.spatial.ConvexHull.html

Modules to my opinion are just groups of pixels that share a common hardware component. In that sense there could many kind of modules per camera. I can only give an SST-1M example here :

  1. PDP module (12 pixels on same front-end board)
  2. FADC module (48 pixels on same FADC board)
  3. Crate module (432 pixes on same Trigger board)
calispac commented 6 years ago

That way you can start from any point and access other info: from a trigger module, access which pixels in the CameraGeometry are in them, or go from CameraGeometry to which module is associated with each pixel, etc...

This is a feature that would be helpful !

kosack commented 6 years ago

Ok, so it seems like some kind of general groupings of pixels are needed, and that part may be very camera-dependent (except for the common ones like module and trigger).

kosack commented 6 years ago

By the way, this is not only useful in ctapipe, but it must also somehow end up in the "Central Configuration Database", since you also need those mappings when operating the telescope (the operator should be able to overlay these in the GUI), and when doing quality checks, to see where a fault is, and also when doing maintenence.

Therefore these concepts need to be translated also into the overall CTA configuration DB data model. No idea who is in charge of that though, or how to go about it! Probably closer related to the OES efforts, so I'll check with them.

dneise commented 6 years ago

Could you remind me what OES stands for?

kosack commented 6 years ago

Could you remind me what OES stands for?

Sorry, CTA has too many abbreviations :-) It's the Observation Execution System, which is the top-level system that includes things like the array control and scheduler.

dneise commented 6 years ago

There is some functions in scipy that for a set of vertices gives the outer vertices. https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.spatial.ConvexHull.html

I think for hexagonal pixels the convex hull of some groups of pixels is not really what one wants. Take these 5 pixels (red outlines) for example. The convex hull is the blue outlined area. path4138-3-6-5

dneise commented 6 years ago

Jason wrote:

I also believe the event files should contain a camera identifier (e.g. "CHEC") inside the file, so the correct instrument information can consequently be loaded.

and

Additionally, the hardware of the cameras may be updated which could change pixel mapping. I have thought of two ways this problem can be avoided:

  1. The pixel positions for the run file are stored inside each file (as they are done for MC at the moment). Therefore the correct mapping is always used. However this is a huge duplication of data.
  2. The database has a system of versions for the instruments. But it would seem that the version can depend on date and on data source (MC or real). And each file should contain a keyword that correctly represents the version of the data inside.

I believe this hits a very important point for analysis and simulation software packages. In my opinion it is paramount to clearly identify the instrument the events in a file refer to, for both simulation software packages and real telescope DAQ software packages (I do not know the exact nomenclature in CTA). By "clearly identify an instrument" I mean of course the entirety of the instrument. A string like "LST" is clearly not enough, since during the design phase of an experiment like CTA there are obviously many different versions of LSTs being simulated. Finding a simulated file 3 months after it was created and not clearly knowing which instrument it is simulating is equal to deleting the file right after creating, maybe it is worse, because it wastes time of a scientist trying to make sense of it.

I have seen this being a problem in other projects. For CTA I am utterly ignorant of the simulation packages but from what Jason says, I deduce that at least some of the files do not clearly say what instrument they are simulating. This is a major fundamental problem, which in my opinion cannot be solved by ctapipe.

The same is of course true for events taken with real instruments e.g. during commissioning. There is an equivalent notion of "versions" of instruments, sometimes also called "epochs". For example a file taken with the SST1M prototype 3 month ago and today might be taken with different versions. As Jason already said: The order of pixels might have changed. Consequently it is absolutely important to clearly identify the version of the instrument the file was taken with, and since the order of pixels may be altered by the DAQ software before actually storing the events to disk, the software is a part of this instrument version.

Ah here is the part I meant:

The order of pixels in simulations of CHEC are different to that of the real data. The simulations will be updated to reflect the true pixel ordering in the future, but then if the information is read from a database, old MC files will have the incorrect pixel mapping applied.

[high lighting by me]

From this statement I conclude that the files created by the simulation packages do not contain a clear identifier, which enables the reader to deduce the what pixel mapping should be applied.

Do I understand that correctly?

dneise commented 6 years ago

I think the question if the file contains a "complete description" of the instrument it was created with or if it only contains a "pointer" to this complete description is an implementation detail.

I modify Jason's way of saying it a bit, but generally I understand him this way:

  1. The [instrument description is] stored inside each file. Therefore the correct mapping is always used. However this is a huge duplication of data.
  2. The database has [all instrument descriptions ...,] each file should contain a [pointer to its instrument description].

As I said, I think these two options are implementation details. The former has a larger footprint, the latter needs more infrastructure. ctapipe has to my understanding no influence on how the event files are created, this is in the hands of those who develop and maintain the simulation software and/or the data acquisition software.

However, I would like to add one comment regarding the database. In my understanding databases are good for multi-user-write & multi-user-read. Web-shops and Web-forums are applications, which would be very hard to implement without the use of databases. Looking up an "instrument description" using some "unique pointer" does not need a database. This is what python dicts do already pretty good. I am trying to say here, that talking about a database sounds to me like deciding for a very specific implementation to solve a certain problem or provide a certain feature, while other less costly implementations are not looked at.

In particular: If those people who write the files agree, that storing the complete description into each event file is consting too much disk-space, and if no infrastructure for a CTA instrument database is available, then we seem to be stuck. But we are not stuck. All we need is some form of "file" with all known instrument descriptions have been produced so far. All camera teams can provide their own files, if they like. All simulation teams can provide their own files if they like.

cta pipe, can either come with the latest copy of these files, and we ctapipe devs can make bi-weekly releases of ctapipe with updated versions of the instrument description files. Or readers can download these files when they need them and cache them.

I have the impression that launching CTA-wide infrastructure can be quite slow. With these instrument-description-library-files, everybody can just start.

watsonjj commented 6 years ago

I think would say that your understanding is mostly correct.

One correction I have is that the current situation in ctapipe is that the pixel mapping is read from the hessio file, as it is stored inside by the simulation software. Therefore any changes to the pixel mapping in the simulation will be correctly used in ctapipe for now.

However the pixel mapping is not necessarily stored inside the file for other formats, which is another reason that calls for the need of pixel mapping to be stored in a database which is usable in any situation, independent to the file format.

But once we transition to use a database, it should include versioning in case a parameter such as pixel mapping is changed in the future.

kosack commented 6 years ago

ONe suggestion: we already have the ability to write out the pixels and read them back in using CameraGeometry.to_table().write("output.fits") which also works for ECSV (text) files and HDF5 files. Why not just write the camera definition in one of those formats and include it with your raw data files? It's not as ideal as having it inside the file, but should still work fine. If TargetIO files are FITS-based (I think I recall that they are), you can just include it as a separate HDU even, and have it inside the file.

kosack commented 6 years ago

This somewhat relatest to the Container cleanup in #722

maxnoe commented 3 years ago

This issue is solved since we now require that EventSource has the correct subarray including all the camera descrptions.

So it is up to the individual event source implementations how this is filled.