HSF / PyHEP.dev-workshops

PyHEP Developer workshops
https://indico.cern.ch/e/PyHEP2023.dev
BSD 3-Clause "New" or "Revised" License
9 stars 1 forks source link

python bindings for rntuple, implementation of "uproot-cpp" #15

Closed lgray closed 7 months ago

lgray commented 1 year ago

Presently the pure python implementation of root-io uproot is an extremely effective tools connecting the root file format to the data-science and wider scientific python ecosystems.

However, uproot makes many heavy GIL-bound computations that quickly limit its scaling in multithreaded environments where we want multiple data streams to downstream processing code. This forbids interesting compute topologies like large thread-reentrant histogram filling and imposes the small tax of needing to spawn processes, each with their own python interpreter, (as opposed to threads sharing a single interpreter) to achieve parallel data processing.

Looking to the future: with RNTuple, Feather (which already has a python-bound C++ implementation for this reason), and other similar high-throughput formats, it seems prudent to develop a GIL-friendly python packages for these HEP specific data sources.

We should find people interested in pursuing and completing these critical tasks.

henryiii commented 1 year ago

FYI, there's also the per-interpreter GIL being introduced in CPython 3.12. That would allow the launching of sub-interpreters each with their own GIL, but without creating separate processes. It doesn't have a Python API in 3.12, but there will be a PyPI package allowing this to be used from Python code. (The current draft of that module is at https://pypi.org/project/interpreters-3-12/)

Don't know if that changes anything here, but something to keep in mind.

lgray commented 1 year ago

Thanks - that's good to know, but we'll be needing to deal with people using the previous interpreters for quite some time (basically until numba supports python 3.12).

jpivarski commented 1 year ago

I've been in favor of a compiled-but-Python-friendly Uproot for some time, but it's always been too large of a task—this will require dedicated effort and coordination (because I'm assuming more than one developer).

Some questions to ask about such a thing:

The main difference between these three options is what people you want to or are able to get together with this. Option 1 pulls Julia developers more into the Coffea world, option 2 is for people who like blank pages, starting from scratch[^1], and option 3 is for pulling it together quickly with the Python + Numba expertise that's already in this area.

[^1]: Unfortunately, I'm one of those people who likes to start things from scratch, and the Rust option appeals to me. But it's more important to pull together things that already have some momentum. If the end result of this is that the Python and Julia HEP tools get more interchangeable, that's probably the best long-term win.

jpivarski commented 1 year ago

Oh, I forgot one (or two) more bullet points:

lgray commented 1 year ago

If UnROOT can drop the gil then we're mostly good FWIW.

jpivarski commented 1 year ago

@tamasgal and @Moelf (Jerry will be attending): we should learn more about the scope of UnROOT's reading (and writing?) capabilities—what data types does it cover?—and how easy it would be to use it in Python. Can we, for instance, read NanoAOD-like TTrees into Awkward Arrays, possibly through Arrow, in a process controlled by Python?

lgray commented 1 year ago

I talked this out a bit with @Moelf at CHEP and at zeroth order it seems possible but we both had a lot of questions about GIL-friendliness.

henryiii commented 1 year ago

I still think by the time something was worked out, 3.12 will be out, probably 3.12 compatible numba will be out, and you might be able to solve this with current uproot + intepreters-3-12, without rewriting much of anything. Might at least be worth testing with intepreters-3-12 and a 3.12 beta now (assuming you could make an interesting test without numba & maybe numpy).

Moelf commented 1 year ago

RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability, so I think we better just roll our own.


Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec). [^1]

From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem ^2 that people can leverage if they do that.

[^1]: We already deal with complex RNTuple schema and nanoAOD converted by using ROOT

lgray commented 1 year ago

Yeah - just switching to parquet / feather after reading the files in is perfectly viable IMO.

It's just a familiarity thing (and a convenience thing), people love TBrowser.

tamasgal commented 1 year ago

The traditional (read before-RNTuple) ROOT support in UnROOT.jl is mostly limited to primitive types, (multiply nested) std containers and a couple of extra streamer logic for the usual suspects. I am already in the planning to rewrite the core parser of UnROOT since currently everything is a bit too static. Julia has great metaprogramming features which would allow a much better design, so a next development iteration cycle is definitely due. Custom streamers need a bit too much care right now (unless the branch splitting is high enough). If I only had more time... ;)

Just my two cents: while I recognise all the huge benefits of RNTuple, I guess the transition phase will be fairly long (my first rough guess is that it will exceed 5 years easily) and the support for TTree-based formats will be mandatory for a very long time. A tiny example in my environment is KM3NeT which will definitely not change the low-level dataformat and will stick to ROOT TTrees for the next 20+ years. We have much more freedom in higher level formats of course, where we also utilise HDF5 and Arrow-based ones ;) That being said, as Jerry emphasised, writing ROOT files will very likely become more and more obsolete downstreams.

Back to the original question from Jim: I find the idea interesting to interface UnROOT via Python but I have very little experience with using Julia in the Python context. A couple of years ago I played around with PyCall.jl to reuse some of our Python libraries in Julia, which was a bit cumbersome due to clashes with Numba JITted functions. As far as I remember that was the biggest problem and a few Cython constructs. Things have evolved since then for sure. The other way around is of course a different story. Anyways, I'll try to free up some time and play around with Julia from within Python, but I am happy if someone else explores that as well.

On 5. Jul 2023, at 18:09, Jerry Ling @.***> wrote:

RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability period, so I think we better just roll our own.

Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec https://github.com/root-project/root/blob/master/tree/ntuple/v7/doc/specifications.md).

uproot read and then hand arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing heavy) UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has. From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem 1 <x-msg://8/#user-content-fn-1-698ff080f2510a705cfc9782c9147dff> 2 <x-msg://8/#user-content-fn-2-698ff080f2510a705cfc9782c9147dff> that people can leverage if they do that.

Footnotes

https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ ↩ <x-msg://8/#user-content-fnref-1-698ff080f2510a705cfc9782c9147dff> https://arrow.apache.org/docs/python/api/cuda.html ↩ <x-msg://8/#user-content-fnref-2-698ff080f2510a705cfc9782c9147dff> — Reply to this email directly, view it on GitHub https://github.com/HSF/PyHEP.dev-workshops/issues/15#issuecomment-1622071898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANGOLUEIQISYB3J4CM5VZDXOWGSRANCNFSM6AAAAAAZ5ZSM2U. You are receiving this because you were mentioned.

sudo-panda commented 1 year ago

+1

ianna commented 1 year ago

+1

nikoladze commented 1 year ago

+1