LDMX-Software / Framework

Event-by-event processing framework using CERN's ROOT and C++17
2 stars 1 forks source link

Auto Python Bindings #44

Open tomeichlersmith opened 3 years ago

tomeichlersmith commented 3 years ago

@omar-moreno and I have chatted about this idea on-and-off and I am really excited about it. I've done some surface-level research and just wanted to open an issue to keep track of my notes.

Goal

The long term goal would be to get rid of ConfigurePython and fire altogether. We would instead call Process::run directly from a script run inside python after doing all the necessary configuration. i.e. instead of having a "configuration script", we would have a running script that is really similar.

# code snippet

from LDMX.Framework import Process
p = Process('run')

from LDMX.SimCore import simulator
sim = simulator.simulator('test')
sim.setDetector('ldmx-dev-v12')
# other configurations and calls to Cpp functions

# attach processors same as before? (hopefully)
p.sequence = [ sim ]

# pause to make sure config is correct
p.pause()

# actually run the processing pass
p.run()

# could do some post-processing python nonsense

Another goal that would be awesome if we can get it to work is to have a Python parent class for both the Cpp pythonizations and potentially new Python processors. i.e. Something like the following


from LDMX.Framework import Process
p = Process('run')

from LDMX.Framework import Analyzer

class MyAnalyzer(Analyzer) :
    def analyze(self, event) : 
        # event is our event bus
        # do normal python analysis nonsense

# attach python analyzer alongside Cpp processors
p.sequence = [ sim, MyAnalyzer() ]
p.run()

Both of these would be run through python instead of fire:

ldmx python3 run.py

Tools

Both have their pros and cons from my research.

Note Boost.Python cppyy
Changes to C++ required optional
CMake Interface available available
Inheritance Handling available automatic
Pre-Compiled Python Module default unavailable?

Long story short. It seems like Boost.Python would be the way to go if we were rebuilding from the ground-up. That way, the python module would be pre-compiled (i.e. faster) and we would have more control over its behavior. However, I am not interested in rebuilding from the ground up and therefore I am interested in using cppyy to "attach" our C++ objects to pythonic ones similar to how ROOT does it (versions > 6.18ish).

Plan

This is a drastic change to the Framework code-base, so I think this would necessarily be a long way off. Since this is also a big change in terms of user-interaction, we would need to be patient with merging anything like this in and potentially make a release separating our current method from this more pythonic one. For now, I am just collecting notes and links and maybe dipping my toe into the coding pool.

wlav commented 3 years ago

Since you mention that it is a "drastic change" and since you mention you're still "collecting notes," I'll take the liberty to provide some more background that is hopefully helpful in the decision making. :)

That way, the python module would be pre-compiled (i.e. faster)

Actually, there is no such thing as pre-compiling python bindings. The only thing that gets compiled is the recipe of bindings construction, not the bindings themselves, so there is no (and can not be any) performance benefit. In fact, it may even be detrimental.

If you care mainly about CPU performance, boost.python is probably also the slowest binder around. The fastest in most cases is swig in "builtin" mode if you are using Python3, as there are optimized paths for it in the CPython interpreter for all simple cases. For most complex cases, cppyy will beat it, assuming at least Python3.8 (which has optimized call paths for closures). The absolute fastest is cppyy on PyPy, but then you have to switch Python interpreters. OTOH, cppyy has the most memory overhead (because of Cling parsing, you have to budget for an extra 100MB of memory over other binders; similarly, PyPy's memory overhead is also higher than CPython's, by about 30MB).

Functionality-wise, I recommend pybind11 over boost.python. In style and use, they're pretty much the same, but pybind11 is a lot more advanced and I suspect at this point far more widely used. Also, if you look in their respective repos, you'll see which receives the most developer cycles these days: boost.python is really just in "keep alive" mode with any new project choosing pybind11 over it. Also, pybind11 has no run-time and installs from PyPI and conda, really simplifying life for any cross-platform software stack.

As for ROOT, although cppyy still has some ROOT heritage (and will always use Cling), what ROOT uses internally is a fork of cppyy (and it's quite a bit behind master, e.g. it does not have the optimized paths mentioned above, but it also disables optimizations for the Cling JIT, inserts expensive null-checking, etc.). You can install cppyy directly from the normal channels such as PyPI and conda-forge. If LDMX already uses ROOT for its I/O and/or analysis needs (looks like it), then sure, use the ROOT version of cppyy. But otherwise it can be used independently and it doesn't tie you to ROOT later on. (OTOH, if you are already loading ROOT into the process anyway, cppyy's memory overhead becomes a non-issue.)

tomeichlersmith commented 3 years ago

Thank you for the correction and this extra detail @wlav ! :tada: I appreciate the input :)