diana-hep / oamap

Perform high-speed calculations on columnar data without creating intermediate objects.
BSD 3-Clause "New" or "Revised" License
81 stars 11 forks source link

HZZ example fails #3

Closed vkuznet closed 6 years ago

vkuznet commented 6 years ago

Jim, if I run the following code:

import numpy as np
import uproot
from oamap.schema import List, Record, Primitive

schema = List(
    counts = "nEvents",
    content = Record(
      name = "Event",
      fields = dict(
        met = Record(
          name = "MissingEnergy",
          fields = dict(
            x = Primitive("f4", data="MET_px"),
            y = Primitive("f4", data="MET_py"),
          )
        ),
        electrons = List(
          counts = "NElectron",
          content = Record(
            name = "Electron",
            fields = dict(
              px = Primitive("f4", data="Electron_Px"),
              py = Primitive("f4", data="Electron_Py"),
              pz = Primitive("f4", data="Electron_Pz"),
              energy = Primitive("f4", data="Electron_E"),
              charge = Primitive("i4", data="Electron_Charge"),
              iso = Primitive("f4", data="Electron_Iso")
            )
          )
        ),
        muons = List(
          counts = "NMuon",
          content = Record(
            name = "Muon",
            fields = dict(
              px = Primitive("f4", data="Muon_Px"),
              py = Primitive("f4", data="Muon_Py"),
              pz = Primitive("f4", data="Muon_Pz"),
              energy = Primitive("f4", data="Muon_E"),
              charge = Primitive("i4", data="Muon_Charge"),
              iso = Primitive("f4", data="Muon_Iso")
            )
          )
        )
      )
    )
  )

class DataSource:
    def __init__(self):
        self.ttree = uproot.open("HZZ.root")["events"]
    def __getitem__(self, name):
        if name == "nEvents":
            # ROOT TTrees don't have a number of entries branch; make it on the fly.
            return np.array([self.ttree.numentries])
        else:
            return self.ttree.array(name)

events = schema(DataSource())
for evt in events:
    print("muons", evt.muons)
    mu = evt.muons[0]
    print("muon", mu.charge, mu.px)

I get this error:

Traceback (most recent call last):
  File "test_hzz.py", line 64, in <module>
    print("muon", mu.charge, mu.px)
  File "/Users/vk/Work/Languages/Python/GIT/oamap/oamap/proxy.py", line 258, in __getattr__
    return generator._generate(self._arrays, self._index, self._cache)
  File "/Users/vk/Work/Languages/Python/GIT/oamap/oamap/generator.py", line 156, in _generate
    return self._getarray(arrays, self.data, cache, self.dataidx, self.dtype, self.dims)[index]
  File "/Users/vk/Work/Languages/Python/GIT/oamap/oamap/generator.py", line 70, in _getarray
    array = self._toarray(arrays[name], dtype)
  File "/Users/vk/Work/Languages/Python/GIT/oamap/oamap/generator.py", line 65, in _toarray
    return numpy.array(maybearray, dtype=dtype)
ValueError: setting an array element with a sequence
jpivarski commented 6 years ago

(I wish GitHub would make me watch my own repositories by default! I'm watching this one now.)

I'm going to have to rewrite some of that documentation. Whatever problem this was hasn't been fixed, but now we no longer even have the _getarray and _toarray functions that are in your stack-trace, and counts is no longer a valid List schema attribute. (It's been moved to a source-wrapper, rather than let a growing list of possible encodings complicate the schemas.)

I just finished a Parquet source, which lets the user view Parquet files as objects without any manual setup. I'm going to do the same thing for ROOT files now, which would dramatically simplify this example, at the cost of hiding some of the details. Then I'll return to the documentation, factorizing it into more reasonable bites and fixing the examples to work with a particular release.

If you'd like to try this feature, you could go back to version 0.3.3, when the README was written. It would have worked then because I tested it as I wrote it.

vkuznet commented 6 years ago

Jim, please inform me when new implementation will be ready. I'm interested to explore OAMap.

jpivarski commented 6 years ago

If you're willing to try something bleeding edge, I just finished the uproot wrapper for oamap.

pip install oamap --user --upgrade

or

git clone https://github.com/diana-hep/oamap.git
cd oamap
python setup.py install --user

and

>>> import uproot
>>> t = uproot.open("../uproot/tests/samples/HZZ.root")["events"]
>>> import oamap.source.root
>>> objects = t.oamap
>>> objects.dataset.show()   # for the ROOT branches interpreted as an object schema
>>> for event in objects:
...    if len(event.NMuon) == 2:
...       m1, m2 = event.NMuon
...       print m1.Muon_Px + m2.Muon_Px, m1.Muon_Py + m2.Muon_Py, m1.Muon_Pz + m2.Muon_Pz
...
-15.1616745, -10.961198, -19.468376
49.8154, 8.07737, 48.133476
98.78025, -99.79196, 738.9426
84.92228, 92.65245, -69.58068
...

One ugly thing about this is that the branch names were not chosen with the intention of rolling them up into objects. In this case, my branches-to-schema algorithm is identifying NMuon as a list of muon objects (not the number of them, the actual list) and muons have attributes Muon_Px, Muon_Py, Muon_Pz rather than just Px, Py, Pz.

This branches-to-schema algorithm is intended to reverse ROOT's splitting algorithm, but this TTree wasn't produced by splitting a C++ object. Hence the weird names.

NanoAOD is the same way: it was manually split using these "heppy" conventions. I need to provide the ability to transform names according to a user-defined function, and provide standard functions for standard sets of conventions, like heppy.

But remember: this feature is brand new as of today. It's the last of a suite of adapters:

All of the above except ROOT have unit tests that you can use to see the API. I'm particularly excited about the HDF5 backend because HDF5 has a lot of good encodings.

vkuznet commented 6 years ago

Jim, I tried this example and it works. I'm quite exciting about this approach. Let me explain. Using OAMap, uproot open up additional possibility for ML since we can read ROOT files and create necessary dataframe attributes using code like in this example and then pass it to ML algorithm. My understanding that it will be mostly at no cost. But of course for serious stuff we need support for custom c++ classes in order to read gen/sim/digi/raw data tiers. But it is a good start. Keep me informed about development. Thanks, Valentin.

On 0, Jim Pivarski notifications@github.com wrote:

If you're willing to try something bleeding edge, I just finished the uproot wrapper for oamap.

pip install oamap --user --upgrade

or

git clone https://github.com/diana-hep/oamap.git
cd oamap
python setup.py install --user

and

>>> import uproot
>>> t = uproot.open("../uproot/tests/samples/HZZ.root")["events"]
>>> import oamap.source.root
>>> objects = t.oamap
>>> objects.dataset.show()   # for the ROOT branches interpreted as an object schema
>>> for event in objects:
...    if len(event.NMuon) == 2:
...       m1, m2 = event.NMuon
...       print m1.Muon_Px + m2.Muon_Px, m1.Muon_Py + m2.Muon_Py, m1.Muon_Pz + m2.Muon_Pz
...
-15.1616745, -10.961198, -19.468376
49.8154, 8.07737, 48.133476
98.78025, -99.79196, 738.9426
84.92228, 92.65245, -69.58068
...

One ugly thing about this is that the branch names were not chosen with the intention of rolling them up into objects. In this case, my branches-to-schema algorithm is identifying NMuon as a list of muon objects (not the number of them, the actual list) and muons have attributes Muon_Px, Muon_Py, Muon_Pz rather than just Px, Py, Pz.

This branches-to-schema algorithm is intended to reverse ROOT's splitting algorithm, but this TTree wasn't produced by splitting a C++ object. Hence the weird names.

NanoAOD is the same way: it was manually split using these "heppy" conventions. I need to provide the ability to transform names according to a user-defined function, and provide standard functions for standard sets of conventions, like heppy.

But remember: this feature is brand new as of today. It's the last of a suite of adapters:

  • read from ROOT, interpreting branch structure as a schema
  • read from Parquet, translating Parquet's schema into OAMap's
  • read/write to HDF5, allowing the user to specify a schema (earlier today— effectively adds ROOT-like columnar data to HDF5)
  • read/write to a the UNIX db format (what shelve does, so that OAMap has a zero-install analogy of Python's built-in shelve persistence).

All of the above except ROOT have unit tests that you can use to see the API. I'm particularly excited about the HDF5 backend because HDF5 has a lot of good encodings.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/diana-hep/oamap/issues/3#issuecomment-359113908

jpivarski commented 6 years ago

Since nearly all ML algorithms require flat data and our data is not flat, I imagine one use case would be to do that flattening via ad hoc user functions. With Numba installed, these functions should be about as fast as C (or faster, because of the lack of intermediate objects) but more importantly, there would be little hassle in setting up the calculation.

My main intention was for exploratory data analysis, but this could possibly take some of the pain out of feature engineering, too.