LDMX-Software / fire

Event-by-event processing framework using HDF5 and C++17
https://ldmx-software.github.io/fire/
GNU General Public License v3.0
1 stars 0 forks source link

Histogram Helpers #9

Open tomeichlersmith opened 2 years ago

tomeichlersmith commented 2 years ago

I need to determine if fire should support a HistogramPool. This would significantly affect how a merging program #4 would operate and may not even be beneficial given how efficent h5py and numpy are on the analysis end.

tomeichlersmith commented 2 years ago

I've put a lot of thought into this and I think it is a good idea to have a clear delineation on when the user should use Python-based interaction with the data files and when they should use C++-based interaction. I think the clearest separation is on filling histograms. At this point in analysis, we transition from "heavy-duty" calculations to making plots "pretty" and so I think it is a good idea to intentionally avoid implementing a C++-based histogram filling tool.

Instead, I think a Python module that helps the user fill histograms with numpy and serialize them with h5py is appropriate. This enforces the separation where C++ processors should be used to calculate new event objects while Python is used to fill histograms, merge them, and plot them.

Notice that some calculations would be classified as "analysis", but instead of enforcing a binning decision at the Cpp level, we can encourage users to calculate their final analysis variables and put those variables into the event. Then fill and plot them later like Python. In the HEP arena, many users call this "ntuplizing" where the hierarchical data is falttened in order to make python analysis easier. The method with which fire serializes hierarchical data makes it already "flattened" but users can still have Cpp processors do analysis tasks like filtering, summing, etc... and create new event objects that can be accessed by a Python plotter.

omar-moreno commented 2 years ago

What type of histograms would be pooled here? Numpy?

On Wed, Feb 2, 2022, 7:53 AM Tom Eichlersmith @.***> wrote:

I've put a lot of thought into this and I think it is a good idea to have a clear delineation on when the user should use Python-based interaction with the data files and when they should use C++-based interaction. I think the clearest separation is on filling histograms. At this point in analysis, we transition from "heavy-duty" calculations to making plots "pretty" and so I think it is a good idea to intentionally avoid implementing a C++-based histogram filling tool.

Instead, I think a Python module that helps the user fill histograms with numpy and serialize them with h5py is appropriate. This enforces the separation where C++ processors should be used to calculate new event objects while Python is used to fill histograms, merge them, and plot them.

Notice that some calculations would be classified as "analysis", but instead of enforcing a binning decision at the Cpp level, we can encourage users to calculate their final analysis variables and put those variables into the event. Then fill and plot them later like Python. In the HEP arena, many users call this "ntuplizing" where the hierarchical data is falttened in order to make python analysis easier. The method with which fire serializes hierarchical data makes it already "flattened" but users can still have Cpp processors do analysis tasks like filtering, summing, etc... and create new event objects that can be accessed by a Python plotter.

— Reply to this email directly, view it on GitHub https://github.com/LDMX-Software/fire/issues/9#issuecomment-1028083457, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4JMXC5SF4HMTRSTVLMJF3UZFHORANCNFSM5MEXJFNA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

tomeichlersmith commented 2 years ago

Sorry, to be clear, this issue was focused on potentially implementing a HistogramPool in the C++ processing chain.

My comments above would shift the focus to having Python helpers for serializing and merging numpy histograms to/from hdf5 files. This would handle the use case of parallel histogram filling over a large data set and then merging the resulting histograms for final plotting.

tomeichlersmith commented 2 years ago

Name: TBD

@AnmolS1Z

Goals:

Features:

Strict Dependencies

Optional Dependencies