Multi-tree support - Githubissues

kreczko commented 2 years ago

Following up on the discussion in the CMS Analysis Tools Task Force Monday Open Meeting

Is your feature request related to a problem? Please describe. I am trying to process multiple TTree inputs at once using coffea

Describe the solution you'd like I've been tracing treeName through the coffea code, and it seems there is no simple way to allow for multiple trees. I would like to

coffea.processor.run_uproot_job to accepts a list of treenames
coffea.processor.ProcessorABC.process data parameter to provide access to these data via df["treename"]["varname] or df["treename.var"] (not fussed about the actual delimiter)

Additional context The CMS L1T analysis (for L1T development) needs to access multiple trees.

Example files (only small samples available publicly):

{
    "2k.root": "https://cernbox.cern.ch/index.php/s/bRvsnkhSl3eCFNl/download",
    "2kmc.root": "https://cernbox.cern.ch/index.php/s/eyAEZH6LQyiySQs/download",
    "2kmu.root": "https://cernbox.cern.ch/index.php/s/FcWgoWCOg6vXOAn/download"
}

Example treenames:

treenames:
      - "l1CaloTowerEmuTree/L1CaloTowerTree"
      - "l1CaloTowerTree/L1CaloTowerTree"

example variables:

- "l1CaloTowerTree/L1CaloTowerTree.L1CaloTower.et" 
- "l1CaloTowerEmuTree/L1CaloTowerTree.L1CaloTower.et"

Note: An uproot file handle can access these variables via <directory>/<tree name>/<object>/<attribute>:

import uproot
f = uproot.open("2kmu.root")
emu_calo_et = f["l1CaloTowerEmuTree/L1CaloTowerTree/L1CaloTower/et"].array()
calo_et = f["l1CaloTowerTree/L1CaloTowerTree/L1CaloTower/et"].array()

alexander-held commented 2 years ago

There is also a fairly common workflow in ATLAS which involves dedicated trees for systematic variations. Multi-tree support would come in handy there too.

lgray commented 8 months ago

This is now possible to a limited extent with coffea 2023. We don't have full table joining but you can fill a histogram from two separate dask-awkward sources and/or specify files in the same dataset with different trees.

It is not incredibly efficient yet but it does work now.

lgray commented 8 months ago

Leaving this open for now since no one's properly tried this out at scale.

kreczko commented 8 months ago

This is now possible to a limited extent with coffea 2023. We don't have full table joining but you can fill a histogram from two separate dask-awkward sources and/or specify files in the same dataset with different trees.

It is not incredibly efficient yet but it does work now.

Is this somewhere documented or do you have a code snippet?

lgray commented 8 months ago

I'll try to include a few schematic examples in the documentation. Systematic variations in separate trees is pretty straight forward, you define it in the dataset definition now.

fileset = {"dataset": {"files":{"/some/path/to/file.root": "nominal", "/some/path/to/file.root": "variation1", ...}}, ...}

Which costs you some additional file opens and implies you need more resources to do the compute, but otherwise works. Table joins are not there yet. If you need to do CMS event tree + run tree you can do it with uproot.dask + nanoevents.

CoffeaTeam / coffea

Multi-tree support #659