JuliaHEP / UnROOT.jl

Native Julia I/O package to work with CERN ROOT files objects (TTree and RNTuple)
https://juliahep.github.io/UnROOT.jl/
MIT License
102 stars 17 forks source link

Do not manage to read a TTree with a structure of arrays of basic types #298

Open peremato opened 10 months ago

peremato commented 10 months ago

EDM4hep root files store in a tree called podio_metadata an object of the type

struct  podio::CollectionIDTable
  vector<unsigned int> m_collectionIDs;
  vector<string> m_names;
};

The following is a reproducer:

using UnROOT

struct CollectionIDTable
    collectionIDs::Vector{UInt32}
    names::Vector{String}
end

f = "Output_REC.root"

tfile = ROOTFile(f)
# tfile.customstructs["podio::CollectionIDTable"] = CollectionIDTable
meta = UnROOT.LazyTree(tfile, "podio_metadata", ["events___idTable"])

The test file can be downloaded from https://github.com/peremato/EDM4hep.jl/blob/main/examples/Output_REC.root

Moelf commented 10 months ago

@tamasgal this thing hits fID equals -2, I think we're missing something fundamental here

tamasgal commented 10 months ago

Actually the only missing thing in this case is the leaf type support for vector<unsigned int> (see https://github.com/JuliaHEP/UnROOT.jl/pull/299). I should have added those, so you can blame me ;) The vector<string> stuff is already supported. You don't need a custom streamer.

With https://github.com/JuliaHEP/UnROOT.jl/pull/299 the following works (without, you will fail reading the m_collectionIDs part:

julia> using UnROOT

julia> f = ROOTFile("/Users/tamasgal/Downloads/Output_REC.root")
ROOTFile with 3 entries and 51 streamers.
/Users/tamasgal/Downloads/Output_REC.root
├─ runs (TTree)
│  └─ "PARAMETERS"
├─ events (TTree)
│  ├─ "AllCaloHitContributionsCombined"
│  ├─ "_AllCaloHitContributionsCombined_particle"
│  ├─ "BeamCal_Hits"
│  ├─ "⋮"
│  ├─ "YokeEndcapCollection"
│  ├─ "_YokeEndcapCollection_contributions"
│  └─ "PARAMETERS"
└─ podio_metadata (TTree)
   ├─ "events___idTable"
   ├─ "events___CollectionTypeInfo"
   ├─ "runs___idTable"
   ├─ "runs___CollectionTypeInfo"
   ├─ "PodioBuildVersion"
   └─ "EDMDefinitions"

julia> LazyBranch(f, "podio_metadata/events___idTable/m_names")
1-element LazyBranch{SubArray{String, 1, Vector{String}, Tuple{UnitRange{Int64}}, true}, UnROOT.Offsetjagg, ArraysOfArrays.VectorOfVectors{String, Vector{String}, Vector{Int32}, Vector{Tuple{}}}}: 
 ["AllCaloHitContributionsCombined", "EventHeader", "BeamCalClusters", "BeamCalClusters_particleIDs", "BeamCalCollection", "BeamCalRecoParticles", "BeamCalRecoParticles_particleIDs", "BeamCal_Hits", "BuildUpVertices", "BuildUpVertices_RP"  …  "TightSelectedPandoraPFOs", "InnerTrackerBarrelHitsRelations", "InnerTrackerEndcapHitsRelations", "OuterTrackerBarrelHitsRelations", "OuterTrackerEndcapHitsRelations", "RefinedVertexJets_rel", "RelationCaloHit", "RelationMuonHit", "VXDEndcapTrackerHitRelations", "VXDTrackerHitRelations"]

julia> LazyBranch(f, "podio_metadata/events___idTable/m_collectionIDs")
1-element LazyBranch{SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Offsetjagg, ArraysOfArrays.VectorOfVectors{UInt32, Vector{UInt32}, Vector{Int32}, Vector{Tuple{}}}}: 
 UInt32[0x3a25675d, 0xd793ab91, 0xf0d073dd, 0x1d19206c, 0xc298a348, 0xc29370d2, 0x3954b563, 0xd2b19e7b, 0xfd03f5d0, 0x310a0f04  …  0x5fa7cf93, 0x029be193, 0x743732ae, 0xc42bbbee, 0xd1211017, 0x8dac6bb6, 0x603a5016, 0xdf24625a, 0xbb4cff22, 0x178c9330]

julia> LazyTree(f, "podio_metadata", [Regex("events___idTable/(.*)") => s"\1"])
 Row │ m_names                                                    m_collectionIDs                                ⋯     │ SubArray{String                                            SubArray{UInt32                                ⋯─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ ["AllCaloHitContributionsCombined", "EventHeader", "BeamC  [975529821, 3616779153, 4040192989, 488185964, ⋯                                                                                                  1 column omitted
tamasgal commented 10 months ago

Fixed in v0.10.21.

@peremato let me know if it works for you.

Btw. just a little bit of clarification: the custom parsing always applies to a branch and not a tree (or set of branches). It's usually needed when the split-level is low (so that one needs to deserialise compound structures) or if the type for a specific branch is simply not supported.

Moelf commented 10 months ago

huh, I don't know why this doesn't error due to fID== -2, maybe because custom struct logic doesn't hit that?

tamasgal commented 10 months ago

How did you get the fID == -2 bubble up? Sorry for my ignorance, I have not looked closely enough 😆

tamasgal commented 10 months ago

Ah I see:

julia> UnROOT.LazyTree(f, "podio_metadata", ["events___idTable"])
fID = -2   # <- added a @show here...
ERROR: BoundsError: attempt to access 2-element Vector{Any} at index [-1]
Stacktrace:
  [1] getindex(A::Vector{Any}, i1::Int64)
    @ Base ./essentials.jl:13
  [2] streamerfor(f::ROOTFile, branch::UnROOT.TBranchElement_10)
    @ UnROOT ~/Dev/UnROOT.jl/src/root.jl:161

Yes, that negative fID is weird. I have some notes on it but I have no solution yet.

EDIT: and yes, if you go to the deepest split level and there is an interpretation (like the one for vector<unsigned int>) you will not hit the logic with the fID

tamasgal commented 10 months ago

In this case the UnROOT.streamerfor needs to figure out the parser logic from the actual streamer, which is there, but fails due to the lookup. The lookup in this case is not index based (on fID) but can be retrieved via the fName. (below I also printed the available streamers).

It all boils down to take the automatic parser generation into this level so that it works without using the split-branches.

julia> UnROOT.streamerfor(f, "podio::CollectionIDTable")
e.streamer.fName = "TObject"
e.streamer.fName = "TCollection"
e.streamer.fName = "podio::GenericParameters"
e.streamer.fName = "pair<string,vector<int> >"
e.streamer.fName = "pair<string,vector<float> >"
e.streamer.fName = "pair<string,vector<string> >"
e.streamer.fName = "pair<string,vector<double> >"
e.streamer.fName = "vector<int>"
e.streamer.fName = "vector<float>"
e.streamer.fName = "edm4hep::CaloHitContributionData"
e.streamer.fName = "edm4hep::Vector3f"
e.streamer.fName = "podio::ObjectID"
e.streamer.fName = "edm4hep::CalorimeterHitData"
e.streamer.fName = "edm4hep::ClusterData"
e.streamer.fName = "edm4hep::ParticleIDData"
e.streamer.fName = "edm4hep::SimCalorimeterHitData"
e.streamer.fName = "edm4hep::ReconstructedParticleData"
e.streamer.fName = "edm4hep::VertexData"
e.streamer.fName = "edm4hep::EventHeaderData"
e.streamer.fName = "edm4hep::SimTrackerHitData"
e.streamer.fName = "edm4hep::Vector3d"
e.streamer.fName = "edm4hep::MCRecoTrackerHitPlaneAssociationData"
e.streamer.fName = "edm4hep::TrackerHitPlaneData"
e.streamer.fName = "edm4hep::Vector2f"
e.streamer.fName = "edm4hep::ObjectID"
e.streamer.fName = "edm4hep::MCParticleData"
e.streamer.fName = "edm4hep::Vector2i"
e.streamer.fName = "edm4hep::RecoParticleVertexAssociationData"
e.streamer.fName = "edm4hep::MCRecoCaloAssociationData"
e.streamer.fName = "edm4hep::TrackData"
e.streamer.fName = "edm4hep::TrackState"
e.streamer.fName = "edm4hep::Quantity"
e.streamer.fName = "podio::CollectionIDTable"
UnROOT.StreamerInfo(UnROOT.TStreamerInfo{UnROOT.TObjArray}("podio::CollectionIDTable", "", 0xe9251d6f, 1, UnROOT.TObjArray("", 0, Any[UnROOT.TStreamerSTL
  version: UInt16 0x0004
  fOffset: Int64 0
  fName: String "m_collectionIDs"
  fTitle: String ""
  fType: Int32 500
  fSize: Int32 24
  fArrayLength: Int32 0
  fArrayDim: Int32 0
  fMaxIndex: Array{Int32}((5,)) Int32[0, 0, 0, 0, 0]
  fTypeName: String "vector<unsigned int>"
  fXmin: Float64 0.0
  fXmax: Float64 0.0
  fFactor: Float64 0.0
  fSTLtype: Int32 1
  fCtype: Int32 13
, UnROOT.TStreamerSTL
  version: UInt16 0x0004
  fOffset: Int64 0
  fName: String "m_names"
  fTitle: String ""
  fType: Int32 500
  fSize: Int32 24
  fArrayLength: Int32 0
  fArrayDim: Int32 0
  fMaxIndex: Array{Int32}((5,)) Int32[0, 0, 0, 0, 0]
  fTypeName: String "vector<string>"
  fXmin: Float64 0.0
  fXmax: Float64 0.0
  fFactor: Float64 0.0
  fSTLtype: Int32 1
  fCtype: Int32 61
])), Set{Any}())

I need to study what uproot is doing with the negative fID, since it's able to get this right:

>>> import uproot

>>> f = uproot.open("/Users/tamasgal/Downloads/Output_REC.root")

>>> f["podio_metadata/events___idTable"]
<TBranchElement 'events___idTable' (2 subbranches) at 0x00010b58eb20>

>>> f["podio_metadata/events___idTable"].array()
<Array [{m_collectionIDs: [...], ...}] type='1 * {m_collectionIDs: var * ui...'>
Moelf commented 10 months ago

yeah, from my very quick look, uproot does not do anything with fID explicitly

tamasgal commented 10 months ago

Yes... I mean, obviously the information is sitting right in front of us ;) So in that case UnROOT should create the corresponding struct and add a readtype or whatever dynamically. That's what's missing.

tamasgal commented 10 months ago

It's just a bit weird that this works fine in so many cases 😆 :

https://github.com/JuliaHEP/UnROOT.jl/blob/77b75d8f8a7d5a6a2b8c408efbcac2c00817e798/src/root.jl#L160

peremato commented 10 months ago

Fixed in v0.10.21.

@peremato let me know if it works for you.

Btw. just a little bit of clarification: the custom parsing always applies to a branch and not a tree (or set of branches). It's usually needed when the split-level is low (so that one needs to deserialise compound structures) or if the type for a specific branch is simply not supported.

First, thanks very much @tamasgal. It works great once you know how to do it.

It is very confusing still for me the way to select the branches and leaves (perhaps is a lack of proper documentation or pre-knowledge of the ROOT file organisation). This works nicely:

ulia> meta = UnROOT.LazyTree(tfile, "podio_metadata", [Regex("events___idTable/(.*)") => s"\1"])
 Row │ m_names                                                                                                  m_collectionIDs                                ⋯
     │ SubArray{String                                                                                          SubArray{UInt32                                ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ ["AllCaloHitContributionsCombined", "EventHeader", "BeamCalClusters", "BeamCalClusters_particleIDs", "B  [975529821, 3616779153, 4040192989, 488185964, ⋯
                                                                                                                                                1 column omitted

but what I would do naively does not

julia> meta = UnROOT.LazyTree(tfile, "podio_metadata", ["m_names", "m_collectionIDs"])
ERROR: MethodError: no method matching LazyBranch(::ROOTFile, ::Missing)

Closest candidates are:
  LazyBranch(::ROOTFile, ::AbstractString)
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:134
  LazyBranch(::ROOTFile, ::Union{UnROOT.TBranch, UnROOT.TBranchElement})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:116

Stacktrace:
 [1] LazyBranch(f::ROOTFile, s::String)
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:134
 [2] LazyTree(f::ROOTFile, tree::UnROOT.TTree, treepath::String, branches::Vector{String}; sink::Type{LazyTree})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:450
 [3] LazyTree
   @ ~/Development/UnROOT.jl/src/iteration.jl:432 [inlined]
 [4] LazyTree(f::ROOTFile, s::String, branches::Vector{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:393
 [5] LazyTree(f::ROOTFile, s::String, branches::Vector{String})
   @ UnROOT ~/Development/UnROOT.jl/src/iteration.jl:390
 [6] top-level scope
   @ REPL[6]:1

the flowing works but the names of the columns are wrong

julia> meta = UnROOT.LazyTree(tfile, "podio_metadata", ["events___idTable/m_names", "events___idTable/m_collectionIDs"])
 Row │ events___idTabl                                                                                          events___idTabl                                ⋯
     │ SubArray{UInt32                                                                                          SubArray{String                                ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ [975529821, 3616779153, 4040192989, 488185964, 3264783176, 3264442578, 961852771, 3534855803, 424489518  ["AllCaloHitContributionsCombined", "EventHead ⋯
                                                                                                                                                1 column omitted

I did also try the naming convention that was used for the other tree "events" with <branch>_<leaf> but also does not work. I see that for the LazyBranch the convention is <branch>/<leaf>. Overall is very confusing.

tamasgal commented 10 months ago

Yes, the problem is indeed that you need to know a little bit about the ROOT structure's subtleties. As you can see, uproot also requires you to point to events___idTable but then does the automatic RecArrat-creation from the sub-branches. This is of course something I'd like to have in UnROOT as well but it requires a lot of restructuring. As always, you learn ROOT iteratively and early design decisions need to be changed quite often (I had so many iterations in UnROOT already 😆 ).

I really hope that I will find a longer time slot (2-4 weeks) next year to spend a significant amount of time on refactoring UnROOT.

>>> import uproot

>>> f = uproot.open("/Users/tamasgal/Downloads/Output_REC.root")

>>> f["podio_metadata/events___idTable"]
<TBranchElement 'events___idTable' (2 subbranches) at 0x00010b58eb20>

>>> f["podio_metadata/events___idTable"].array()
<Array [{m_collectionIDs: [...], ...}] type='1 * {m_collectionIDs: var * ui...'>
tamasgal commented 10 months ago

Regarding the events tree, you do the same, but also here you need to provide the full path to the sub-branches:

julia> LazyTree(f, "events", [r"BeamCal_Hits/BeamCal_Hits.*\.(\w+)$" => s"\1"])
 Row │ time             x                energyError      energy           y   ⋯
     │ SubArray{Float3  SubArray{Float3  SubArray{Float3  SubArray{Float3  Sub ⋯
─────┼──────────────────────────────────────────────────────────────────────────
 1   │ []               []               []               []               []  ⋯
 2   │ []               []               []               []               []  ⋯
 3   │ []               []               []               []               []  ⋯
 4   │ []               []               []               []               []  ⋯
 5   │ []               []               []               []               []  ⋯
 6   │ []               []               []               []               []  ⋯
 7   │ []               []               []               []               []  ⋯
 8   │ []               []               []               []               []  ⋯
 9   │ []               []               []               []               []  ⋯
 10  │ []               []               []               []               []  ⋯
 11  │ []               []               []               []               []  ⋯
 12  │ []               []               []               []               []  ⋯
 13  │ [0.0, 0.0,       [-8.2, -8.       [0.0, 0.0,       [0.0267, 0       [63 ⋯
 14  │ []               []               []               []               []  ⋯
 15  │ []               []               []               []               []  ⋯
 16  │ []               []               []               []               []  ⋯
 17  │ []               []               []               []               []  ⋯
 18  │ []               []               []               []               []  ⋯
 19  │ [0.0, 0.0]       [3.17, 3.2       [0.0, 0.0]       [0.0305, 0       [-1 ⋯
 20  │ []               []               []               []               []  ⋯
 21  │ []               []               []               []               []  ⋯
 22  │ [0.0, 0.0]       [151.0, 15       [0.0, 0.0]       [0.0128, 0       [-8 ⋯
  ⋮  │        ⋮                ⋮                ⋮                ⋮             ⋱
                                                    4 columns and 3 rows omitted
peremato commented 10 months ago

I was not doing this. If I do

julia> events = LazyTree(f, "events", ["BeamCal_Hits"])
 Row │ BeamCal_Hits_en            BeamCal_Hits_ti            BeamCal_Hits_en            BeamCal_Hits_po            BeamCal_Hits_po            BeamCal_Hits_po  ⋯
     │ SubArray{Float3            SubArray{Float3            SubArray{Float3            SubArray{Float3            SubArray{Float3            SubArray{Float3  ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1   │ []                         []                         []                         []                         []                         []               ⋯
 2   │ []                         []                         []                         []                         []                         []               ⋯
 3   │ []                         []                         []                         []                         []                         []               ⋯
 4   │ []                         []                         []                         []                         []                         []               ⋯
 5   │ []                         []                         []                         []                         []                         []               ⋯
 6   │ []                         []                         []                         []                         []                         []               ⋯
 7   │ []                         []                         []                         []                         []                         []               ⋯
 8   │ []                         []                         []                         []                         []                         []               ⋯
 9   │ []                         []                         []                         []                         []                         []               ⋯
 10  │ []                         []                         []                         []                         []                         []               ⋯
 11  │ []                         []                         []                         []                         []                         []               ⋯
 12  │ []                         []                         []                         []                         []                         []               ⋯
 13  │ [0.0, 0.0, 0.0, 0.0, 0.0,  [0.0, 0.0, 0.0, 0.0, 0.0,  [0.0267, 0.0214, 0.0853,   [3290.0, 3290.0, 3290.0,   [-8.2, -8.16, -1.92, 31.1  [63.1, 63.1, 66. ⋯
 14  │ []                         []                         []                         []                         []                         []               ⋯
 15  │ []                         []                         []                         []                         []                         []               ⋯
 16  │ []                         []                         []                         []                         []                         []               ⋯
 17  │ []                         []                         []                         []                         []                         []               ⋯
 18  │ []                         []                         []                         []                         []                         []               ⋯
 19  │ [0.0, 0.0]                 [0.0, 0.0]                 [0.0305, 0.0754]           [-3350.0, -3360.0]         [3.17, 3.21]               [-19.2, -19.2]   ⋯
 20  │ []                         []                         []                         []                         []                         []               ⋯
 21  │ []                         []                         []                         []                         []                         []               ⋯
 22  │ [0.0, 0.0]                 [0.0, 0.0]                 [0.0128, 0.00132]          [3360.0, 3380.0]           [151.0, 151.0]             [-86.8, -86.8]   ⋯
 23  │ [0.0]                      [0.0]                      [2.02f-6]                  [3390.0]                   [-62.9]                    [61.3]           ⋯

and the leaves get the name <branch>_<leaf>

ulia> names(events)
8-element Vector{String}:
 "BeamCal_Hits_energyError"
 "BeamCal_Hits_time"
 "BeamCal_Hits_energy"
 "BeamCal_Hits_position_z"
 "BeamCal_Hits_position_x"
 "BeamCal_Hits_position_y"
 "BeamCal_Hits_cellID"
 "BeamCal_Hits_type"
tamasgal commented 10 months ago

I mean, technically we can do this LazyTree creation on the fly automatically but I could not come up with a way which works reliably, especially with all those funny (read weird) namings and dot-madness. So eventually we need to ask the user to provide the regex to help UnROOT make reasonable fieldnames like x instead of BeamCal_Hits.position.x which would anyways not be valid due to the dots, so it needs to be translated to BeamCal_Hits_position_x or so, but notice here that BeamCal_Hits is redundant, since the branch is already called like that. ROOT however still stores that with that prefix. BUT not always and I still don't know why. We have some logic in UnROOT which works quite OK but it will still give you funny names in many cases. That's why I introduced that regex-thing, which I highly abuse 😉 see here:

https://github.com/KM3NeT/KM3io.jl/blob/65318a1265fd6bfa064b06a5c4721711160e50f1/src/root/offline.jl#L164-L193

Actually that is basically the place where we would need to incorporate the original streamer which tells you how to name them and how the hierarchy is structures, but it's quite complex and UnROOT then really would have to define those structs at runtime, which brings us to the...

...painful fact: if you let UnROOT define the structs, you will not be able to use those types in your own analysis code explicitly. Which means that of course Julia will happily pass you the instances, and your function will eat those types as well and everything is fine (and type-stable) but you will not be able to restrict or use those types to utilise multiple dispatch features since they are created on the fly and attached to the UnROOT namespace (that would technically be type piracy) and of course you will have to deal with dynamic dispatch all(?) the time.

That's why I kind of like the that we simply use LazyTree, which is a highly parametric type, signalling that it's a universal thing (like a named tuple) but it allows you to hide your data in some container type and/or reinterpret it to your own own types. So we force to use a barrier in order to be able to make use of a solid type system. That's what I have shown in KM3io jl Making UnROOT jl comfortable for KM3NeT - Tamas Gal

On the other hand, you can of course provide your custom structs and make UnROOT utilise those, so you have full control and maximum efficiency. That's also shown in the presentation above, but of course requires more understanding of the underlying structures.

I use both techniques with great performance.

tamasgal commented 10 months ago

I was not doing this. If I do

Yes that works too, if you are fine with the UnROOT naming ;)

peremato commented 10 months ago

Hi Tom. I agree we can do several things and hide the UnROOT level. I you want have a look at what I have been doing with EDM4hep.jl. I am mapping a simple Julia type (isbits) to a set of columns in the LazyTree within a StructArray in a recursive manner. This is very convenient and good performance for some use cases. There are some examples like ttbar_digits.jl to illustrate what you can do. I have given a presentation this week to the team developing this event model. It is very encouraging.