LDMX-Software / ldmx-sw

The Light Dark Matter eXperiment simulation and reconstruction framework.
https://ldmx-software.github.io
GNU General Public License v3.0
22 stars 20 forks source link

Unpacking Raw Data #998

Closed tomeichlersmith closed 3 years ago

tomeichlersmith commented 3 years ago

The notes below are stale and should not be used as reference. Poke Tom to get the most up-to-date implementation, things are moving fast and we will try to catch up with documentation after the test beam actually occurs.

I have recently started acquiring raw data files from an HGC ROC and I'm interested in developing an interface between them and ldmx-sw. This will (probably) be required for test beam work coming up within a few months.

Design

Within a new module (name TBD maybe Raw, but that seems lame), I will have a single producer called Unpacker which will be given a raw file as if it has been read off by the detector. This means this single file will have all the data from all the subsystems and the different "blocks" of data will be labeled by some header containing a detector ID to identify which subsystem the data block came from.

Since each block of data is labeled by an ID of some sort, the Unpacker will then call separate subroutines for the different types of blocks that need to be decoded (probably according the chip that is doing the readout). Then these subroutines would individually produce the Digi C++ objects from the block of ones and zeros given to it by Unpacker.

Now the question is what is the format of the raw file "as if it has been read off by the detector". Going all the way down to raw binary files (non-ROOT files) is "dangerous" because the different subsystems have different streams of data creating their own raw files. Here, I am going to write a simple "merging" python script that will simply translate separate binary files into branches of an LDMX_Events tree that are filled with the encoded binary data (e.g. full 32-bit words). The output file from this script can then be provided as an input file to the Unpacker which will unpack these encoded words into their Digi C++ objects.

Why not have this "merging" python script also decode the words? We could but it is dangerous. It will be helpful for validation when we have each step of the processing chain (optionally) saved so we can check and re-check that things are working as intended.

Why this Design?

I see the Unpacker as a new data "source" and therefore fills a similar role as SimCore's Simulator. A new module highlights the fact that this "source" of data in ldmx-sw will be used by all subsystems in the same way. Since this Unpacker needs to handle all of the different types of chips that ldmx-sw is using to readout our different subsystems, having its own submodule also allows us to have these subroutines that actually do the unpacking be spread out into files (and maybe classes?) for each chip. This centrality also allows the Unpacker to handle the header information that should be put into the EventHeader or RunHeader (depending).

Moreover, further in the future, we may want to compare simulated data to real data at the raw level. These subroutines could have both an unpack and pack routine so we can have both an Unpacker and a Packer.

@jmmans @bryngemark @omar-moreno I want to hear from you. Does this design sound like it could handle LDMX's needs?

jmmans commented 3 years ago

This comment is not a reply to Tom's entry above, exactly, but is instead a new subthread on the topic of compact and complete electronics ids.

As a general concept, 'electronics ids' represent the idea of the pieces of the electronics chain associated with a given cell.

The 'compact' electronics id contains only the information needed to carry out the transformation between readout space and detector id (logical) space. For example, in the ECal case the minimal electronics id information is the optical fiber number the data arrived on, the elink number within the packet from that fiber, and the channel number within the elink.

The 'complete' electronics id contains additional information which is not strictly necessary for data unpacking (and may not be available at the time). However, the information may be valuable or even critical for other operations including preparing calibration tables or analyzing detector data for tuning purposes. For example, the HGCROC id (which is not the same as the elink id) is very important when correlating with detector effects.

The 'compact' electronics id and the map in this form are needed only during packing/unpacking and not during other operations. However, access is time-critical as it is needed for the primary reconstruction sequence. Internally, the best format for the map is likely a fully-unrolled O(1) tightly-packed index. However, the map in this form is quite inscrutiable and somewhat memory-intensive if a large amount of information is stored.

The complete electronics map information can be constructed best using sub-maps. For example, the connection between cells and HGCROC channels on an Ecal sensor is the same on every module. The elinks /from/ these HGCROCs will have different identities depending on where in the detector the module is located, but the mapping is heirarchical, not arbitrary. Therefore, a multi-level mapping (one table for associations on a module, another identifying which modules feed which elinks on the Polarfire) makes sense, which would not necessarily be fully unrolled in memory.

Based on these thoughts, I propose that the unpacking map (at least for Ecal) should be constructed in-memory from appropriate sub-level maps. The unpacking map should not have a separate (ASCII) identity.

tomeichlersmith commented 3 years ago

After talking with @jmmans about how the (un)packing layer will be handled, I have formulated a more detailed plan on what to do next. My goal is to avoid having to define a new event bus object that must be compiled separately from the rest of ldmx-sw (so it is more easily available to the front-end), but I'd also like the raw event bus object to be "compact" in the sense that there isn't a separate event bus object for each stream of data.

Raw Data Structure

My solution is to have a simple std::map of std::string to std::vector<unsigned char>.

The key would be a unique identifying name for the origin of the data stream, probably containing the subsystem or chip name (e.g. EcalPrecisionReadout or something) and the value would be the buffer containing all the raw/packed data for that stream for a single event including the header of the data stream. I am not overly committed to unsigned char being the content-type of the buffer, so if anyone else has another suggestion, I'd be happy to hear it.

1. Merging

As outlined above, the first step would be building this raw data structure from the multiple data streams coming from our different subsystems. Since ROOT has internal support for std::map and std::vector, a "readout node" would only require ROOT and a simple program to merge these data streams (not a fully functional install of ldmx-sw). This merging would simply consist of "poping" the latest event off each of these data streams and constructing this std::map. I see this as a "minimal" amount of knowledge necessary for this light program.

I am going to start by writing a quick-n-dirty python script to do this merging with the simple HGC ROC raw files I already have. In the future, we will probably want to include a more robust program.

2. Unpacking

Actual unpacking will be done here in a new Packing module. This module will have a single processor which grabs the raw data map and loops through its entries. The key will help us determine which translator to use and the translator will receive the buffer of data to decode/unpack. The translators will be dynamically loaded, so we could have them reside in the subsystem modules or this module depending on our preference.

tomeichlersmith commented 3 years ago

These notes are so old they aren't useful anymore. With the merging of #1031 and #1025 I'm calling this done.