JuliaHEP / UnROOT.jl

Native Julia I/O package to work with CERN ROOT files objects (TTree and RNTuple)
https://juliahep.github.io/UnROOT.jl/
MIT License
96 stars 17 forks source link

[RNTuple] accessing nested structs is not lazy enough #314

Open Moelf opened 6 months ago

Moelf commented 6 months ago

Consider the following top-level field (column in the table analogy)

├─ Symbol("AntiKt4TruthDressedWZJetsAux:") ⇒ Struct
│                                            ├─ :m ⇒ Vector
│                                            │       ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=23)
│                                            │       └─ :content ⇒ Leaf{Float32}(col=24)
│                                            ├─ :pt ⇒ Vector
│                                            │        ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=17)
│                                            │        └─ :content ⇒ Leaf{Float32}(col=18)
│                                            ├─ :eta ⇒ Vector
│                                            │         ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=19)
│                                            │         └─ :content ⇒ Leaf{Float32}(col=20)
│                                            ├─ :constituentWeights ⇒ Vector
│                                            │                        ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=29)
│                                            │                        └─ :content ⇒ Vector
│                                            │                                      ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=30)
│                                            │                                      └─ :content ⇒ Leaf{Float32}(col=31)

currently, when we loop over the events, the access is too "eager":

for evt in rntuple
    evt.var"AntiKt4TruthDressedWZJetsAux:".pt
end

In this case, we only want to access the storage related to the pTs (i.e. rntuple column 17 and 18), but in reality we're reading all the columns (17,18,19,20,23,24,29,30,31) as soon as we do evt.var"AntiKt4TruthDressedWZJetsAux:"

One possible way is to switch to AwkwardArray.jl by @jpivarski, and represent the whole rntuple as a big RecordArray and theoretically it will work for columnar access (i.e. rntuple.var"AntiKt4TruthDressedWZJetsAux:".pt), and it may not solve our event-iteration problem.

Another possible way is to use StructArrays.jl more smartly, @peremato did you run into anything like this in EDM4hep.jl? If so anything you found working?

peremato commented 6 months ago

Another possible way is to use StructArrays.jl more smartly, @peremato did you run into anything like this in EDM4hep.jl? If so anything you found working?

With EDM4hep, I think I do to have this problem since the top level is Vector of POD structs instead of being a struct of vectors as is in this case. It is true that I read all the fields (I guess) because I really construct at the end a SaA of the container.