`nanoAOD_ttbar` latency

Moelf commented 1 year ago

          Thanks for looking at this. You can find a file with 10 events in https://cernbox.cern.ch/s/Vy6Yhz47vDaYiWa

Originally posted by @grasph in https://github.com/JuliaHEP/UnROOT.jl/issues/273#issuecomment-1755720345

Moelf commented 1 year ago

ah ha, @grasph, looks like this is automatically "fixed" on Julia 1.10:

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.3 (2023-08-24)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using UnROOT
[ Info: Precompiling UnROOT [3cd96dde-e98d-4713-81e9-a4a1b0235ce9]

julia> f = @time ROOTFile("./nanoaod_ttbar.root");
  0.257723 seconds (304.96 k allocations: 20.962 MiB, 4.06% gc time, 97.23% compilation time: 21% of which was recompilation)

julia> tree = @time LazyTree("./nanoaod_ttbar.root", "Events");
 24.559936 seconds (11.57 M allocations: 1.527 GiB, 0.44% gc time, 97.57% compilation time: <1% of which was recompilation)

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-beta3 (2023-10-03)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> f = @time ROOTFile("./nanoaod_ttbar.root");
  0.144006 seconds (235.61 k allocations: 17.266 MiB, 98.26% compilation time)

julia> tree = @time LazyTree("./nanoaod_ttbar.root", "Events");
  1.704714 seconds (11.09 M allocations: 1.500 GiB, 5.74% gc time, 54.80% compilation time)

Moelf commented 1 year ago

It might be a fun exercise to find out what can we do for v1.9, but notice unlike https://github.com/JuliaHEP/UnROOT.jl/pull/279#issuecomment-1756069804. This is NOT a case of inference failure causing boxing, because the total number of allocation and amount are almost identical:

11.09 M allocations: 1.500 GiB
11.57 M allocations: 1.527 GiB,

oschulz commented 1 year ago

What magic is 1.10 doing here, I wonder?

Moelf commented 1 year ago

Frames said this on Slack:

I know large numbers of keyword arguments (as in thousands) were causing a lot of problems in for circuit modelling, because circuit models are often super parameterized. And I believe that kicked off at least a round of work to improve compiler handling of that maybe 9-11 months ago. So it could be that work, which did make it into 1.10.

oschulz commented 1 year ago

Thanks @Moelf

Moelf commented 1 year ago

Some more comments:

It is still 17 times slower than RDataFrame and it does not scale well with the number of branches.

17x seems large but I don't think it's problematic when it's 0.1s vs. 1.4s. Considering ROOT takes so long to compile I'm not surprised. If anything, we can try to add more precompilation. I don't know how much wider does a root file get, NanoAOD has 1.5k branches. I would argue if user deals with 5k columns or whatever, at that point they ought know they should select columns.

Nothing infinitely scales to all cases. The current design has 2 main objective:

support fast iteration once constructed, instead of asking user to construct something else again
customized display()

The second point is coupled to the first point, because naively you may want to just return Dict or some light weight DataFrame, but then your display() is all messed up, we need control over both types to customize display().

Besides the remaining latency with 1.10.X, asking to move to the latest beta version gives a bad impression to the user on Julia, while the issue is that we are pushing Julia to its limits by using long tuples.

I'm saying just wait. It's already Julia 1.10-beta3, so not that far from release.

Moelf commented 1 year ago

Finally, I'm happy to have a sink keyword interface or something, taking inspiration from CSV.jl or Arrow.jl.

But what those 2 packages are NOT good at and don't care at all, is the user experience out of the box (the display, the iteration capability). We probably shouldn't 'do that.

We also should think harder about what function names / apis are actually gonna be called, I think LazyTree is not perfect considering RNTuple will come eventually. You want to convey 2 things:

QuickRead(file, treename) is meant for column access, for-loop is going to be slow
TypedTree(tree) is meant for row-iteration.

Alternatively, we can have:

UnROOT.read_root(file, treename, sink)
and default sink to DictTree and have a TypedTree as option

oschulz commented 1 year ago

NanoAOD has 1.5k branches

Just curious, why would anyone make that many branches? Are they used to emulate what should be arrays, or so? I'm not saying we shouldn't be able to handle it well, I'm just surprised at the sheer number.

Moelf commented 1 year ago

they include systematics, there are O(10) different types of objects you reconstruct, each has a few dozens of systematics. And then there are all the L1 trigger and HLT triggers. Many branches are already jagged. https://cms-nanoaod-integration.web.cern.ch/integration/master/mc94X_doc.html

oschulz commented 1 year ago

And we can't really do hierarchies - I get it.

Moelf commented 1 year ago

@grasph I honestly think you also just want to tell them to select, like, it's absurd to just open all the branches unless you're first day on the job.

I mean first of all tutorials should be given with slimmed files, and second, we don't have CMS nanoAOD systematics CP tools anyway, so it's pointless to read all branches

grasph commented 1 year ago

Related issue: #272

These improvements with 1.10.x are great. Thanks, Jerry, for the investigations.

I think we should still have a low-latency option especially to address the data-frame like analysis, for which fast row access is not required. It can be a parameter to LazyTree constructor to switch to this option.

I made some tests showing that with some pre-compilations (for common column types) and untyped columns, we can reach a 0.3s setup time. Measurements (done with Julia 1.9.1) within the comments of the code below.

using UnROOT

file = ROOTFile("nanoaod.root");

tree = file["Events"];
brnames = keys(tree);

function f1(file, t, brs)
    res = Vector{LazyBranch}(undef, length(brs))
    for (i,x) in enumerate(brs)
        res[i] = LazyBranch(file, t[x])
    end
    res
end

function f4(file, t, brs)
    res = Vector{Pair{Symbol, LazyBranch}}(undef, length(brs))
    for (i,x) in enumerate(brs)
        res[i] = (Symbol(x) => LazyBranch(file, t[x]))
    end
    res
end

#this list of branches cover all branch types found in the nanoaod.root test file
brwarmup = ["run","event","BeamSpot_type","BeamSpot_sigmaZ","nboostedTau","boostedTau_idAntiEle2018","boostedTau_jetIdx","boostedTau_charge","boostedTau_chargedIso","Electron_convVeto","TrigObj_id","L1Reco_step"]

# Timing provided in comments were measured with only one group
# of code between two #----... lines  included. UnROOT v0.10.15

#----------------------------------------------------------------------
# 0.78s
@time f1(file, tree, brwarmup)
# 0.27s
@time brs = f1(file, tree, brnames);
#----------------------------------------------------------------------

#----------------------------------------------------------------------
##1.05 s
#@time brs = f1(file, tree, brnames);
##0.27 s
#@time brs = f1(file, tree, brnames);
#----------------------------------------------------------------------

#----------------------------------------------------------------------
#using DataFrames
## 1.04s
#@time "dict warmup" wu = Dict(f4(file, tree, brwarmup));
##0.95 seconds (1.09 M allocations: 73.972 MiB, 2.18% gc time, 99.83% compilation time)
#@time "warmup df wrapping" df = DataFrame(wu, copycols=false);
## 0.27s
#@time "dict" brs = Dict(f4(file, tree, brnames));
## 0.0015s (0.9 seconds without the dataframe-wrapping warmup)
#@time "df wrapping" df = DataFrame(brs, copycols=false);
## 0.15s
#@time "sum" sum(df.nMuon)
#----------------------------------------------------------------------

#----------------------------------------------------------------------
#using DataFrames
## 1.9s
#@time "DataFrame warmup" DataFrame(f4(file, tree, brwarmup), copycols=false);
## 0.28s
#@time "DataFrame" df = DataFrame(f4(file, tree, brnames), copycols=false);
## 0.15s
#@time "sum" sum(df.nMuon)
#----------------------------------------------------------------------

#----------------------------------------------------------------------
##1.34 s
#@time "Dict warmup" brs = Dict(f4(file, tree, brnames));
##0.29s
#@time "Dict" brs = Dict(f4(file, tree, brnames));
##0.21s
#@time "sum" sum(brs[:nMuon])
#----------------------------------------------------------------------

Moelf commented 1 year ago

Thanks a lot! Can this wait? We can discuss with @tamasgal at juliahep workshop maybe

grasph commented 1 year ago

It would be good to have a solution before the workshop and we can discuss about improvements there.

The main contributors to the number of branches of the CMS nanoad are actually the trigger line flags: it has one branch per line and trigger many lines.

Although the setup overhead is large compared to the time to read all branches, the main point is not about reading all of them, but to have access to all the branches from the LazyTree object.

The issue with the display comes also from the compilation time: it took me 279s to call show(::LazyTree) with UnROOT v.0.10.17 and Julia 1.9.1 and 148s with UnROOT v.0.10.17 and Julia 1.10.0beta3. So 1.10.x does NOT solve the issue for the display and when the LazyTree is instantiated interactively without an ending ';'.

The issue is twofold:

DataFrame (or any user choice table) use case
LazyTree for a large number of branches.

Current solution for (1) is to pass a LazyTree to the constructor of the Table of choice. It's inefficient if LazyTree instantiation takes time. This can be solved by providing a method to fill any container following the Tables interface ala CSV.read or a method like the f4() method of my previous comment. It is simple to implement and quite independent from the existing code.

(2) can be addressed by providing two options for LazyTree: with typed (current NamedTuple option) and untyped columns. The code is more involved, but I have already a prototype.

tamasgal commented 1 year ago

I definitely tend to towards the first approach. Something like a generic interface which can be turned into LazyTree or DataFrame similar to what you mentioned with CSV.read.

I have not spent enough time on this problem yet to be honest. I use UnROOT on a daily basis but only a with a handful (but fairly large and complex) branches and for those of course it works totally fine.

Moelf commented 1 year ago

ok what about this temporary stopgap solution:

we add a new keyword argument so something like:

LazyTree(file, treename; sink = ...)

if !isnothing(sink), then after forming the Dict{Symbol, LazyBranch}, we call the sink constructor. The (mental) interface here is that the sink should be compatible with ColumnTable, so the sink() must be able to take in our Dict.

This has two advantages:

non-breaking (we can do a breaking change for UnROOT1.0, up to discussion)
similar to CSV.read

grasph commented 1 year ago

You would expect the LazyTree() function to returns a LazyTree instance, so I think it's cleaner to have a new function e.g., getTree(), that will take a sink type. If the sink type is LazyTree() then it will do the same as LazyTree().

You mentioned breaking changes, all the proposals I've made so far preserve backward compatibility.

grasph commented 1 year ago

(f4() is also a good option)

Moelf commented 1 year ago

I agree. But I don't know what we should name the getTree() function yet, which is why I'm trying to defer naming this until 1.0.

I think effectively this is a getTreeOrRNTuple(), but we should give it a good name so it's not too confusing that it does something different than plain getindex()

grasph commented 1 year ago

We can pick up a name and write in the documentation that it can change. An alternative name can be readTree(), although the reading is deferred.

Moelf commented 1 year ago

again, readTree doesn't capture the fact you read RNTuple use the same function.

If we pick up a name and make it public API, that's potentially one more thing to break on 1.0

grasph commented 1 year ago

same as for LazyTree ;)

grasph commented 1 year ago

Anyway, let's just add a sink option to LazyTree as you proposed. I've updated the code in my fork sink-option branch. Let me know if I should make a PR. See changes here (only one file modified).

grasph commented 1 year ago

There is still an issue with the master (b59a63). Below the LazyTree call during with sink-opt patch on top of v0.10.15 and master for my nano AOD file.

Table type	master	v0.10.15
LazyTree	120 s	21 s
DataFrames.DataFrame	100 s	2.3s

Moelf commented 1 year ago

same as for LazyTree ;)

But we already have LazyTree, so adding a sink options is not breaking, we will just recommend a new interface at 1.0.

But if you add getTree() and remove it at 1.0, that's an additional breaking.

Can you show more complete code for what you're doing with Data frame? I can investigate soon

grasph commented 1 year ago

The issue came from the merge. I merged with origin/master instead of origin/main. Now it's fixed. I'll do the PR.

PS: by investigating, I've found out why the TDirectory cache, now removed, was not working properly.

JuliaHEP / UnROOT.jl

`nanoAOD_ttbar` latency #280