LDMX-Software / ldmx-sw

The Light Dark Matter eXperiment simulation and reconstruction framework.
https://ldmx-software.github.io
GNU General Public License v3.0
22 stars 21 forks source link

Framework fails to copy basic-type branches to output file if they are observed #1369

Open tomeichlersmith opened 5 months ago

tomeichlersmith commented 5 months ago

Describe the bug When writing an output file while reading an input file, the branches that are basic-type and being observed by the process are not copied into the output file properly. This is difficult to explain, so it is best to understand by looking through the example in To Reproduce.

To Reproduce I originally observed this with some actual simulation data I'm working with, but I condensed it into a smaller reproducible example here. I reference a few C++ source and python files that I've copied here since they are small. The TLDR on those files is that I produce two branches data and unobs which are both simple float branches. The "analysis" only looks at the data branch while creating a new file where both branches are copied into it. The resulting events file has a data branch that only has corrupted-looking data while the unobs branch has data matching what was originally produced.

Files for Reproducing ### Produce.cxx ```cpp #include "Framework/EventProcessor.h" #include class Produce : public framework::Producer { /// the random number generator, unseeded so it produces the same results each time std::mt19937 rng; /// the distribution of values in the Hits vector std::uniform_real_distribution rand_float; public: Produce(const std::string& name, framework::Process& p) : framework::Producer(name, p), rng{}, // this is where a seed for the RNG would be put rand_float{0.,100.} {} ~Produce() = default; void produce(framework::Event& event) final { float data = rand_float(rng); float unobs = rand_float(rng); event.add("data", data); event.add("unobs", unobs); } }; // Produce DECLARE_PRODUCER(Produce); ``` ### prod.py ```python from LDMX.Framework import ldmxcfg p = ldmxcfg.Process('prod') p.maxEvents = 10 p.outputFiles = [ 'events.root' ] p.sequence = [ ldmxcfg.Producer.from_file('Produce.cxx') ] ``` ### Ana.cxx ```cpp #include "Framework/EventProcessor.h" class Ana : public framework::Analyzer { public: Ana(const std::string& name, framework::Process& p) : framework::Analyzer(name, p) {} ~Ana() = default; void analyze(const framework::Event& event) final { const auto& data = event.getObject("data"); std::cout << data << std::endl; } }; // Ana DECLARE_ANALYZER(Ana); ``` ### ana.py ```python from LDMX.Framework import ldmxcfg p = ldmxcfg.Process('ana') p.inputFiles = [ 'events.root' ] p.outputFiles = [ 'ana-events.root' ] p.sequence = [ ldmxcfg.Analyzer.from_file('Ana.cxx') ] ``` ### check.py ```python import sys import uproot with uproot.open({sys.argv[1]:'LDMX_Events'}) as t: data = t['data_prod'].array(library='np') print(f'{data=}') unobs = t['unobs_prod'].array(library='np') print(f'{unobs=}') ```
  1. denv init ldmx/pro:v4.0.1
  2. denv fire prod.py
  3. denv fire ana.py
  4. Observe Error with check.py
tom@appa:~/code/ldmx/ana-basic-rewrite-bug$ denv python3 check.py events.root
data=array([81.47237 , 90.57919 , 12.698682, 91.337585, 63.235928,  9.75404 ,
       27.849823, 54.68815 , 95.75069 , 96.48885 ], dtype=float32)
unobs=array([13.547701, 83.500854, 96.88678 , 22.103405, 30.816704, 54.722057,
       18.838198, 99.28813 , 99.64613 , 96.76949 ], dtype=float32)
tom@appa:~/code/ldmx/ana-basic-rewrite-bug$ denv python3 check.py ana-events.root
data=array([2.6254541e+38, 1.6092322e-15, 1.6092322e-15, 1.6092322e-15,
       1.6092322e-15, 1.6092322e-15, 1.6092322e-15, 1.6092322e-15,
       1.6092322e-15, 1.6092322e-15], dtype=float32)
unobs=array([13.547701, 83.500854, 96.88678 , 22.103405, 30.816704, 54.722057,
       18.838198, 99.28813 , 99.64613 , 96.76949 ], dtype=float32)

Desired behavior I would really like for the Framework to be able to copy all types of branches whether or not they have been observed. While it sounds funny, being observed does trigger a different method of copying within the Framework since - if no processor needs a branch - we optimize the copy by letting ROOT copy the buffers directly instead of trying to load them into memory. This is not an issue with how the simple data types are being read as can be seen by the printouts within the "analysis" processor.

Environment: Output of denv config:

denv_workspace="/home/tom/code/ldmx/ana-basic-rewrite-bug"
denv_name="ana-basic-rewrite-bug"
denv_image="ldmx/pro:v4.0.1"
denv_mounts=""
denv_shell="/bin/bash -i"
denv_network="true"
Docker version 26.1.3, build b72abbb
tvami commented 5 months ago

But isnt this a feature of uproot? like if you did an event loop in the check.py it should work for the ana-events.root, no?

tomeichlersmith commented 5 months ago

No, I originally found this issue with a subsequent ldmx-sw standalone processor and used uproot as a double check. (proof below).

To be very clear. This is only an issue with basic-type branches that are "root" branches and not the sub-branches that are generated when we create a branch with a more complicated parent class (e.g. std::vector<double> does not have this issue). That is why we haven't observed this before since most data is passed around in packages of these more complicated packages.

tom@nixos:~/code/ldmx/1369-copy-obs-basic-types$ denv fire prod.py 
---- LDMXSW: Loading configuration --------
Processor source file /home/tom/code/ldmx/1369-copy-obs-basic-types/Produce.cxx is newer than its compiled library /home/tom/code/ldmx/1369-copy-obs-basic-types/libProduce.so (or library does not exist), recompiling...
done compiling /home/tom/code/ldmx/1369-copy-obs-basic-types/Produce.cxx
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
---- LDMXSW: Event processing complete  --------
tom@nixos:~/code/ldmx/1369-copy-obs-basic-types$ denv fire ana.py 
---- LDMXSW: Loading configuration --------
Processor source file /home/tom/code/ldmx/1369-copy-obs-basic-types/Ana.cxx is newer than its compiled library /home/tom/code/ldmx/1369-copy-obs-basic-types/libAna.so (or library does not exist), recompiling...
done compiling /home/tom/code/ldmx/1369-copy-obs-basic-types/Ana.cxx
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
81.4724
90.5792
12.6987
91.3376
63.2359
9.75404
27.8498
54.6881
95.7507
96.4889
---- LDMXSW: Event processing complete  --------
tom@nixos:~/code/ldmx/1369-copy-obs-basic-types$ denv fire reana.py 
---- LDMXSW: Loading configuration --------
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
-6.38457e+35
-2.53744e-11
-2.53744e-11
-2.53744e-11
-2.53744e-11
-2.53744e-11
-2.53744e-11
-2.53744e-11
-2.53744e-11
-2.53744e-11
---- LDMXSW: Event processing complete  --------

where prod.py and ana.py (and their referenced C++ files) are the same as in the original description. reana.py is copied below for completely but is just running the same ana processor but over the copied events file.

from LDMX.Framework import ldmxcfg
p = ldmxcfg.Process('ana')
p.inputFiles = [ 'ana-events.root' ]
p.sequence = [ ldmxcfg.Analyzer.from_file('Ana.cxx') ]
Wrapping Basic Types Avoids this Issue If we wrap the basic type in a more complicated class, we can get around this issue easily. ### Additions #### to Produce.cxx:produce ```cpp std::vector wrap{data}; event.add("wrap", wrap); ``` #### to Ana.cxx:analyze ```cpp std::cout << data << " "; // instead of std::endl const auto& wrap = event.getCollection("wrap").at(0); std::cout << wrap << std::endl; ``` ### Printouts Left side is unwrapped, right side is wrapped ``` tom@nixos:~/code/ldmx/1369-copy-obs-basic-types$ denv fire prod.py ---- LDMXSW: Loading configuration -------- Processor source file /home/tom/code/ldmx/1369-copy-obs-basic-types/Produce.cxx is newer than its compiled library /home/tom/code/ldmx/1369-copy-obs-basic-types/libProduce.so (or library does not exist), recompiling... done compiling /home/tom/code/ldmx/1369-copy-obs-basic-types/Produce.cxx ---- LDMXSW: Configuration load complete -------- ---- LDMXSW: Starting event processing -------- ---- LDMXSW: Event processing complete -------- tom@nixos:~/code/ldmx/1369-copy-obs-basic-types$ denv fire ana.py ---- LDMXSW: Loading configuration -------- Processor source file /home/tom/code/ldmx/1369-copy-obs-basic-types/Ana.cxx is newer than its compiled library /home/tom/code/ldmx/1369-copy-obs-basic-types/libAna.so (or library does not exist), recompiling... done compiling /home/tom/code/ldmx/1369-copy-obs-basic-types/Ana.cxx ---- LDMXSW: Configuration load complete -------- ---- LDMXSW: Starting event processing -------- 81.4724 81.4724 90.5792 90.5792 12.6987 12.6987 91.3376 91.3376 63.2359 63.2359 9.75404 9.75404 27.8498 27.8498 54.6881 54.6881 95.7507 95.7507 96.4889 96.4889 ---- LDMXSW: Event processing complete -------- tom@nixos:~/code/ldmx/1369-copy-obs-basic-types$ denv fire reana.py ---- LDMXSW: Loading configuration -------- ---- LDMXSW: Configuration load complete -------- ---- LDMXSW: Starting event processing -------- 3.46836e+17 81.4724 3.46882e+17 90.5792 3.46882e+17 12.6987 3.46882e+17 91.3376 3.46882e+17 63.2359 3.46882e+17 9.75404 3.46882e+17 27.8498 3.46882e+17 54.6881 3.46882e+17 95.7507 3.46882e+17 96.4889 ---- LDMXSW: Event processing complete -------- ```
tomeichlersmith commented 2 months ago

So, for any readers who didn't participate in writing the Bus which holds the event objects (i.e. everyone else), the handling of these in-memory objects is rather opaque. With this in mind, I'm going to elucidate a potential source of the issue.


ROOT does not expose a common interface for setting addresses of branches.[^1] I eventually found a solution of using TBranchElement::SetObject for all "higher level" branches and TBranch::SetAddress for the lower-level (BSILFD[^2]) branches. This is written into the attach method of the Bus::Passenger which uses argument-type-deduction to delegate to attachBasic for BSILFD types and attach for everything else. You'll notice that this is the exact separation we are observing. Once we wrap a float (for example) in a more complicated class the issue goes away.

This makes me want to investigate the attachBasic implementation to see if we need to modify it.

https://github.com/LDMX-Software/ldmx-sw/blob/ebbf27abb5b46398e8539c7ab02cfd9e77adcc69/Framework/include/Framework/Bus.h#L513-L530

Perhaps adding a branch->SetAutoDelete(false) like

https://github.com/LDMX-Software/ldmx-sw/blob/ebbf27abb5b46398e8539c7ab02cfd9e77adcc69/Framework/include/Framework/Bus.h#L475-L477

which is what is done in attach.

[^1]: Caveat: I could revisit this and look at using TBranch::SetAddress everywhere. At the time, I could not get that to work and so here we are.

[^2]: Bool Short Int Long Float Double