SuperNEMO-DBD / Falaise

Simulation, Reconstruction and Analysis Software for the SuperNEMO Experiment
http://supernemo.org/Falaise
GNU General Public License v3.0
5 stars 26 forks source link

Task: Rationalize Storage and Retrieval of Metadata in Falaise #157

Open drbenmorgan opened 4 years ago

drbenmorgan commented 4 years ago

This task is to bring together several issues on metadata into one for coherence sake. Directly affected issues which are merged are:

Indirect, but related Issues are #90 on Conditions Data access, and demos of the Art mechanisms in SuperNEMO-DBD/Impressionist#6. It's also strongly correlated with data cataloguing as metadata likely forms the basis of that system.

Overview

In Falaise applications, "metadata" is stored in .brio files in a TTree named "GI". Each entry in the tree is an instance of the datatools::properties dictionary class, each of which is intended to form a section of an overall datatools::multi_properties instance. Falaise applications have the option to read/write this multi_properties object to a text file separate from the .brio file holding the events.

At present, read/write of data to the "GI" store is handled by the application code, with no way for plugin modules to read/write it. No external programs area available to query the metadata of a .brio file.

There is no documentation on what is stored in the .brio file.

Proposed Improvements

  1. Provide a falaise-file-dumper application to allow query/dumping of the metadata store(s).
    • Remove the ability of flsimulate/flreconstruct themselves to read/write the metadata to a separate file. This can lead to a loss of coherence between meta/event data.
    • The effect of a "separate" metadata file can be achieved by running falaise-file-dumper on the brio file after it's generated.
    • Depending on the systems outside of Falaise that consume the metadata, may want JSON output as well as multi/properties.
  2. Improve what data is always stored. For example, #57 outline some of the data that should be present from a simulation run.
    • From flsimulate, the full settings should be stored.
    • From flreconstruct, the full pipeline script and configuration should be stored, including any custom variant settings. *Plus, all settings from the input file should be stored (directly or indirectly via some primary key).
    • The main idea is so that the complete processing "provenance" can be tracked.
    • For example, when reading a raw or simulated data file into flreconstruct, we must reconstitute the same geometry for the detector the data originates from!
  3. Maybe provide a "metadata service" that modules in flreconstruct can access.
    • Not for things like conditions data!
    • "Maybe" as metadata is really run/process level info, not event.
    • Likely implement as "write once, read many" to avoid edit/overwrite.

Anything else?

Task(s)

The first "file dumper" task is pretty independent of the rest, so can be its own PR. The second and third tasks needs some input from you all (the second task replaces #118).

@bmorgan can work on the first, but second and third will need volunteers at least for testing and review given their overlap with Reconstruction/Analysis/Data Quality.

drbenmorgan commented 4 years ago

See also BxCppDev/Bayeux#53, which requests implementation of an "include" mechanism for the current scripting files. That'll allow much simpler composition and tracking of configuration, e.g. with inclusion, you could write a script:

...
#@include "snemo/reconstruction/default.conf"

[name="ChargedParticleTracker" type="snemo::reconstruction::charged_particle_tracking_module"]
AFD.minimal_delayed_time : real as time = 25 us

i.e. be able to override single or multiple parameters without having to copy/paste the entire script. Since it also means the configuration of flsimulate/flreconstruct end up in a single multi_properties instance, that'll be much easier to store, track, update and reconstitute.