AIDASoft / podio

PODIO
GNU General Public License v3.0
24 stars 59 forks source link

Add pythonizations for collection subscript #570

Closed m-fila closed 4 months ago

m-fila commented 5 months ago

BEGINRELEASENOTES

ENDRELEASENOTES

458 and key4hep/EDM4hep#288 show there is some interest in tuning the cppyy generated bindings. The cppyy has a feature for this called pythonization -> a callback can be registered for all the classes that fulfill some predicate (e.g. derived from podio::CollectionBase).

This PR contains:

The mechanism of loading the pythonizations is inspired by ROOT pythonizations with a change that de order of loading of the pythonizations is defined.

To load the pythonizations in a downstream projects, for instance for project utilizing a "edm4hep" namespace

from podio.pythonizations import load_pythonizations

load_pythonizations("edm4hep")

alternatively a selection of pythonizations can be loaded by importing and applying them one by one

The alternative mechanism I tough about:

tmadlener commented 5 months ago

If I understand correctly this would effectively give an "opt-in" solution for generated EDMs, as the EDM has to explicitly load these pythonizations. For EDM4hep this would not really be a problem, as we have some python glue code already in any case.

For me this looks like a good solution, but we should add a bit of documentation on how to use this in generated EDMs.

m-fila commented 5 months ago

If I understand correctly this would effectively give an "opt-in" solution for generated EDMs, as the EDM has to explicitly load these pythonizations. For EDM4hep this would not really be a problem, as we have some python glue code already in any case.

yes, it's opt-in. load_pythonizations could be added to EDM4hep's __init__.py to load on import

For me this looks like a good solution, but we should add a bit of documentation on how to use this in generated EDMs.

Thanks, I will add it

jmcarcell commented 5 months ago

Note that this will make indexing significantly slower: With this PR and the one in EDM4hep:

In [2]: import edm4hep

In [3]: coll = edm4hep.MCParticleCollection()

In [4]: coll.create()
Out[4]: <cppyy.gbl.edm4hep.MutableMCParticle object at 0x55f984f08120>

In [5]: %timeit coll[0]
3.45 µs ± 72.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

and without this PR:

2.46 µs ± 26.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It should be said somewhere that the current behaviour is you get an object when indexing out of bounds and then it crashes when you try to do something with that object. But I think the case of indexing out of bounds should be rare since typically you run either a for loop without indexes or if it has indexes then it's bounded by len(coll).

m-fila commented 5 months ago

Note that this will make indexing significantly slower

Thanks for pointing this out, I'll check if this can be improved by mapping __getitem__ to at and raising IndexError

m-fila commented 5 months ago

I tried benchmarking on following function:

def bench():
    try:
        coll[0]
    except:
        pass

coll = MCParticleCollection()
bench() # give cppyy time to load lazily
%timeit bench()

coll.create()
bench() # give cppyy time to load lazily
%timeit bench()
Bound check __getitem__ implementation Average time, no-exception branch Average time, exception branch
1. Default cppyy mapping to operator[] 683 ns -
2. ✔️ Originally proposed in this PR, bound-check in python 1.2 µs 779 ns
3. ✔️ Mapped to at 730 ns 30.8 µs
4. ✔️ at wrapped to raise IndexError 826 ns 32.1 µs

The implementations using at seems to be more performant in the usual case. Would it be acceptable to go with option 3. or 4.?

A C++ callback added to the Collection template could potentially achieve the same as at (option 3.) on the no-exception branch and slightly better than python side bound-check (option 2.) on the exception branch. But then extending with other c++ callbacks would be very ugly

The performance of at based implementations is poor if there is an exception, because by design cppyy tries to evaluate all eligible overloads if there was a throw (in this caseat(size_t) and at(size_t) const ). Since cppyy-1.7.0 it's possible to explicitly select between const and non-const overload, but it's newer than what we have

tmadlener commented 5 months ago

I see there has been quite some discussion (and progress?) here. A few comments / questions from my side in no particular order:

m-fila commented 5 months ago

Did you by any chance check how the times compare to pure c++? Mainly for comparison, I don't think we learn too much from them other than a 30 % increase on the python side is pretty negligible compared to pure c++.

Pure C++ google benchmark between a single element access with operator[] vs at:

----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
squareBracketsNoError       4.59 ns         4.59 ns    152241304
atNoError                   5.26 ns         5.25 ns    131108856
atError                     1552 ns         1550 ns       452227

where

static void squareBracketsNoError(benchmark::State& state) {
  edm4hep::MCParticleCollection coll;
  coll.create();
  for (auto _ : state) {
    try {
      benchmark::DoNotOptimize(coll[0]);
      benchmark::ClobberMemory();
    } catch (const std::out_of_range&) {
    }
  }
}

static void atNoError(benchmark::State& state) {
  edm4hep::MCParticleCollection coll;
  coll.create();
  for (auto _ : state) {
    try {
      benchmark::DoNotOptimize(coll.at(0));
      benchmark::ClobberMemory();
    } catch (const std::out_of_range&) {
    }
  }
}

static void atError(benchmark::State& state) {
  edm4hep::MCParticleCollection coll;
  coll.create();
  for (auto _ : state) {
    try {
      benchmark::DoNotOptimize(coll.at(1));
      benchmark::ClobberMemory();
    } catch (const std::out_of_range&) {
    }
  }
}

both this and python benchmark are isolated single access, looping over the collection will probably give different results

IIUC all of the presented methods are still possible with python callbacks, right?

Yes, all the alternatives from that table were python callbacks

I think the exceptional case should not be the (main) driver for performance considerations. After all you are dealing with an error at that point and not with something "usual".

In the last commit I changed the implementation proposed in PR to option 4 from the table. It seems to me as a reasonable compromise

tmadlener commented 5 months ago

Can you rebase this onto the latest master to pick up the RNTuple API fixes? That should also make CI pass again.

m-fila commented 4 months ago

Sorry for the delay. Rebased and added documentation. I also split the callback into two functions:

I think previously it was not clear that callback should start with some conditional statement to select what should be modified

tmadlener commented 4 months ago

Looks like all the necessary fixes have landed in upstream root (master and v6.32). IIRC this can be merged now, right?

m-fila commented 4 months ago

Looks like all the necessary fixes have landed in upstream root (master and v6.32). IIRC this can be merged now, right?

Awesome 😎 Yes, it's ready to go

hegner commented 4 months ago

and off we go...