Mu2e / EventNtuple

Event-based analysis ntuple for the Mu2e Experiment
Apache License 2.0
2 stars 21 forks source link

Add interface for filtering on demtsh leaves to python utility #167

Open bonventre opened 4 months ago

bonventre commented 4 months ago

demtsh leaves no longer appear as keys in uproot ttree object and so the whole branch must be converted to an awkward array (cannot use filter_name to select a subset of leaves). Tested with uproot 5.3.8rc2

AndrewEdmonds11 commented 4 months ago

Thanks, Richie. I think this is a general issue with vector< vector > branches. Here is the output of trkana.show(filter_name=['dem', 'dem.*', 'demfit', 'demtsh'], interpretation_width=100)

name                 | typename                 | interpretation
---------------------+--------------------------+-----------------------------------------------------------------------------------------------------
dem                  | vector<mu2e::TrkInfo>    | AsGroup(<TBranchElement 'dem' (29 subbranches) at 0x7f73ad37d400>, {'dem.status': AsJagged(AsDtyp...
dem/dem.status       | int32_t[]                | AsJagged(AsDtype('>i4'))
dem/dem.goodfit      | int32_t[]                | AsJagged(AsDtype('>i4'))
dem/dem.seedalg      | int32_t[]                | AsJagged(AsDtype('>i4'))
... snip ...
dem/dem.avgedep      | float[]                  | AsJagged(AsDtype('>f4'))
demfit               | std::vector<std::vect... | AsObjects(AsVector(True, AsVector(False, Model_mu2e_3a3a_TrkFitInfo)))
demtsh               | std::vector<std::vect... | AsObjects(AsVector(True, AsVector(False, Model_mu2e_3a3a_TrkStrawHitInfo)))

The dem branch can have its individual leaves accessed because it is just a vector and I guess ROOT has made subbranches for each member in the struct. The demtsh and demfit branches don't have the same interpretations.

We can see the same thing in ROOT with trkana->Print("dem*"): dem has subbranches but demfit and demtsh do not

******************************************************************************
*Br    0 :dem       : Int_t dem_                                             *
*Entries :       10 : Total  Size=      17888 bytes  File Size  =        126 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.27     *
*............................................................................*
*Br    1 :dem.status : Int_t status[dem_]                                    *
*Entries :       10 : Total  Size=        744 bytes  File Size  =        129 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.26     *
*............................................................................*
*Br    2 :dem.goodfit : Int_t goodfit[dem_]                                  *
*Entries :       10 : Total  Size=        749 bytes  File Size  =        130 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.26     *
*............................................................................*
... snip ...
*............................................................................*
*Br   30 :demfit    : vector<vector<mu2e::TrkFitInfo> >                      *
*Entries :       10 : Total  Size=       3398 bytes  File Size  =       1239 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   2.34     *
*............................................................................*
*............................................................................*
*Br   31 :demlh     : vector<vector<mu2e::LoopHelixInfo> >                   *
*Entries :       10 : Total  Size=       2421 bytes  File Size  =       1578 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   1.22     *
*............................................................................*
... snip ...
*............................................................................*
*Br   57 :demtsh    : vector<vector<mu2e::TrkStrawHitInfo> >                 *
*Entries :       10 : Total  Size=     105745 bytes  File Size  =      68136 *
*Baskets :        5 : Basket Size=      32000 bytes  Compression=   1.54     *
*............................................................................*
*Br   58 :demtsm    : vector<vector<mu2e::TrkStrawMatInfo> >                 *
*Entries :       10 : Total  Size=      27154 bytes  File Size  =      12775 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   2.09     *
*............................................................................*

I've had a quick play with the splitlevel of the branches and it doesn't seem to have helped...

I think we either we live with this, or we flatten things down to one dimension and associate each hit/fit with a track via an id. See some discussion here about how you could use the id in uproot: https://github.com/scikit-hep/uproot5/discussions/229. This would be a significant change though but may be worth it

It also looks like having vector< vector > may be slower to read https://github.com/scikit-hep/uproot5/discussions/327 although that may have since been solved with AwkwardForth (https://arxiv.org/pdf/2102.13516) and I haven't noticed things being particularly slow

brownd1978 commented 4 months ago

A possible solution is to define the end times as a struct with named values instead of an array. This same problem presumably affects all the contents currently using std::Array.

On Tue, Jun 18, 2024 at 1:42 PM Andrew Edmonds @.***> wrote:

Thanks, Richie. I think this is a general issue with vector< vector > branches. Here is the output of trkana.show(filter_name=['dem', 'dem.*', 'demfit', 'demtsh'], interpretation_width=100)

name | typename | interpretation ---------------------+--------------------------+----------------------------------------------------------------------------------------------------- dem | vector | AsGroup(<TBranchElement 'dem' (29 subbranches) at 0x7f73ad37d400>, {'dem.status': AsJagged(AsDtyp... dem/dem.status | int32_t[] | AsJagged(AsDtype('>i4')) dem/dem.goodfit | int32_t[] | AsJagged(AsDtype('>i4')) dem/dem.seedalg | int32_t[] | AsJagged(AsDtype('>i4')) ... snip ... dem/dem.avgedep | float[] | AsJagged(AsDtype('>f4')) demfit | std::vector<std::vect... | AsObjects(AsVector(True, AsVector(False, Model_mu2e_3a3a_TrkFitInfo))) demtsh | std::vector<std::vect... | AsObjects(AsVector(True, AsVector(False, Model_mu2e_3a3a_TrkStrawHitInfo)))

The dem branch can have its individual leaves accessed because it is just a vector and I guess ROOT has made subbranches for each member in the struct. The demtsh and demfit branches don't have the same interpretations.

We can see the same thing in ROOT with trkana->Print("dem*"): dem has subbranches but demfit and demtsh do not


Br 0 :dem : Intt dem Entries : 10 : Total Size= 17888 bytes File Size = 126 Baskets : 1 : Basket Size= 32000 bytes Compression= 1.27 ............................................................................ Br 1 :dem.status : Intt status[dem] Entries : 10 : Total Size= 744 bytes File Size = 129 Baskets : 1 : Basket Size= 32000 bytes Compression= 1.26 ............................................................................ Br 2 :dem.goodfit : Intt goodfit[dem] Entries : 10 : Total Size= 749 bytes File Size = 130 Baskets : 1 : Basket Size= 32000 bytes Compression= 1.26 ............................................................................ ... snip ... ............................................................................ Br 30 :demfit : vector<vector > Entries : 10 : Total Size= 3398 bytes File Size = 1239 Baskets : 1 : Basket Size= 32000 bytes Compression= 2.34 ............................................................................ ............................................................................ Br 31 :demlh : vector<vector > Entries : 10 : Total Size= 2421 bytes File Size = 1578 Baskets : 1 : Basket Size= 32000 bytes Compression= 1.22 ............................................................................ ... snip ... ............................................................................ Br 57 :demtsh : vector<vector > Entries : 10 : Total Size= 105745 bytes File Size = 68136 Baskets : 5 : Basket Size= 32000 bytes Compression= 1.54 ............................................................................ Br 58 :demtsm : vector<vector > Entries : 10 : Total Size= 27154 bytes File Size = 12775 Baskets : 1 : Basket Size= 32000 bytes Compression= 2.09 ............................................................................

I've had a quick play with the splitlevel of the branches and it doesn't seem to have helped...

I think we either we live with this, or we flatten things down to one dimension and associate each hit/fit with a track via an id. See some discussion here about how you could use the id in uproot: scikit-hep/uproot5#229 https://github.com/scikit-hep/uproot5/discussions/229. This would be a significant change though but may be worth it

It also looks like having vector< vector > may be slower to read scikit-hep/uproot5#327 https://github.com/scikit-hep/uproot5/discussions/327 although that may have since been solved with AwkwardForth (https://arxiv.org/pdf/2102.13516) and I haven't noticed things being particularly slow

— Reply to this email directly, view it on GitHub https://github.com/Mu2e/TrkAna/issues/167#issuecomment-2176937068, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH573ZPWU3BYDY3U5JJP3ZICLUXAVCNFSM6AAAAABJQRB73OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZWHEZTOMBWHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- David Brown @.*** Office Phone (510) 486-7261 Lawrence Berkeley National Lab M/S 50R5008 (50-6026C) Berkeley, CA 94720

sam-grant commented 4 months ago

Hi. I'm working through issues, trying to establish if they're still a problem.

This issue is fundamental to the way that "local" track fit variables are stored in TrkAna. I don't think it's possible to access the individual leaves in these types of branches directly, you have to load the entire branch first. However, one option to optimise things a bit could be to load the branch inside a function which only returns the leaves you want. That way the entire branch won't hang around in memory.

Something like this?

import uproot
import awkward as ak

def GetLeaves(fileName, branchName, leafNames):

    leaves = {}

    with uproot.open(fileName + ":TrkAna/trkana") as tree:
        branch = tree.arrays([branchName])

        for leafName in leafNames:
            leaves[leafName] = branch[branchName][leafName]

        leaves = ak.zip(leaves)

    return leaves

fileName = "trkana.root"
array = GetLeaves(fileName=fileName, branchName="demfit", leafNames=["time", "sid", "mom"])
print(array[0])
[[{time: 708, sid: 0, mom: {...}}, {...}, {time: 727, sid: 2, mom: {...}}]]

I tested this quickly and it returns the same result as loading the entire branch and then printing the leaves one-by-one, like this:

with uproot.open(fileName + ":TrkAna/trkana") as tree:     
    array = tree.arrays(["demfit"])
print(array["demfit"]["time"][0])
print(array["demfit"]["sid"][0])
print(array["demfit"]["mom"][0])

[[708, 717, 727]]
[[0, 1, 2]]
[[{fCoordinates: {fX: -76.9, fY: 36.1, fZ: 57.8}}, {...}, {...}]]

Let me know what you think.

bonventre commented 4 months ago

I didn't know about ak.zip, that's definitely convenient. I had tried something similar using the uproot batching feature

`a = {field : [] for field in fields} for batch in uproot.iterate(files,filter_name=["kltsh"]): a[field].append(ak.flatten(batch["kltsh"][field]).to_numpy())

for field in fields: a[field] = np.concatenate(a[field]) ` and was able to process a large trkana dataset with everything fitting in memory - it was using 100% but I think that's probably just python's garbage collector not being proactive. So I think this is ok for now

sophiemiddleton commented 4 months ago

I like Richie's suggestion. @sam-grant some of your code might overlap with the new util/mu2epyutil, if something is missing from there please add it, and we should add something like Richie's to that too

AndrewEdmonds11 commented 4 months ago

Thanks, everyone. I agree with Sophie, if there are little tricks that make working in uproot/awkward arry easier, then let's add them to the new utility class

AndrewEdmonds11 commented 3 months ago

Hi everyone, let's keep this issue open until we have a working interface for it in the python utility. I will rename the issue with the new task. Anyone should feel free to assign themselves to this