C++ <-> Python conversion performance

benbovy commented 1 year ago

I did a quick benchmark to evaluate the overhead of C++ -> Python conversion:

py::array_t<PyObjectGeography> create(py::array_t<double> xs, py::array_t<double> ys) {
    py::buffer_info xbuf = xs.request(), ybuf = ys.request();

    auto result = py::array_t<PyObjectGeography>(xbuf.size);
    py::buffer_info rbuf = result.request();

    double *xptr = static_cast<double *>(xbuf.ptr);
    double *yptr = static_cast<double *>(ybuf.ptr);
    py::object *rptr = static_cast<py::object *>(rbuf.ptr);

    py::gil_scoped_release();
    for (size_t i = 0; i < xbuf.shape[0]; i++) {
        auto point_ptr = PointFactory::FromLatLonDegrees(xptr[i], yptr[i]);

        // either one or the other of the two code lines below are commented out.
        // "cast" benchmark (calls py::cast)
        rptr[i] = py::cast(std::move(point_ptr));
        // or
        // "no_cast" benchmark (just create an empty python object)
        rptr[i] = py::object()
    }

    return result;

...

m.def("create", &create);

Here are the results:

x = np.random.rand(10_000)
y = np.random.rand(10_000)

# "cast" benchmark
%timeit s2shapely.create(x, y)
# 12.6 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# "no cast" benchmark
%timeit s2shapely.create(x, y)   
# 4.46 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The cast version is almost 3x slower than the no cast version. That's a lot for "just" wrapping the C++ geography in a Python object. Not sure what's happening (the Geography (sub)classes only implement move semantics, so there shouldn't be any data copy at this level), but there must be something wrong.

benbovy commented 1 year ago

Tracked this down in pybind11's type_caster_generic::cast.

The bottleneck seems that pybind11 keeps a map of all registered instances and first tries to find and return the corresponding instance from the map, creating a new one if not found.

In the example above (more generally in geography creation functions), we know that no instance already exists, we could skip this step.

EDIT:

Other potential optimizations:

In certain geography creation functions, we may know exactly the type of the (derived) geography object to create in Python, so we might bypass pybind11's automatic downcasting (see https://pybind11.readthedocs.io/en/stable/advanced/classes.html#custom-automatic-downcasters)
type_caster_generic::cast handles multiple inheritance via values_and_holders. This is unlikely necessary for Geography classes and we may want to skip creating 1-sized std::vector objects (+ executing other logic) each time a Python geography object is created.
Repeated look-up for Python types in pybind11::detail::get_type_info(), which is not necessary here.

Note also: the benchmark above may be biased as pybind11 probably optimizes the empty py::object() case.

benbovy commented 1 year ago

Now a quick benchmark to evaluate the overhead of Python -> C++ conversion:

int dummy(PyObjectGeography obj) {
    // equivalent to `obj.cast<Geography*>()` where obj is a `py::object`
    obj.as_geog_ptr();
    return 0;
}

int dummy_no_cast(PyObjectGeography obj) {
    return 0;
}

int get_dimensions(PyObjectGeography obj) {
    return obj.as_geog_ptr()->dimension();
}

...

m.def("dummy", py::vectorize(&dummy));
m.def("dummy_no_cast", py::vectorize(&dummy_no_cast));
m.def("get_dimensions", py::vectorize(&get_dimensions));

Results:

x = np.random.rand(100_000)
y = np.random.rand(100_000)

points = s2shapely.create(x, y)

%timeit s2shapely.dummy_no_cast(points)
# 542 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit s2shapely.dummy(points)
# 8.39 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit s2shapely.get_dimensions(points)
# 8.93 ms ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Significant overhead here too.

benbovy commented 1 year ago

After more investigation, it looks like most of the conversion overhead is coming from calls to pybind11::detail::get_type_info() to get the C++ type from a Python type or vice-versa. This helper function looks into pybind11 registered mappings of C++ / Python types and is called in various places like py::isinstance<T>(handle), py::object::cast(), py::cast(), etc.

In our case (at least for the Python -> C++ conversion within a vectorized function operating on geographies), we don't need to perform this lookup for each array element. It is enough (and almost free!) to just extract the pointer to the C++ object from the Python object and cast it to the Geography base class:

auto inst = reinterpret_cast<py::detail::instance *>(py_obj.ptr());
return reinterpret_cast<Geography*>(inst->simple_value_holder[0]);

"Just extract" is not really right, we also want to check first if we can safely cast the Python object to a Geography object (or raise), and check that in an efficient way. Unfortunately, this doesn't seem to be supported in pybind11. Maybe we could wrap pybind11::detail::get_type_info() and cache the results so the lookup is done only once.

This still uses pybind11's internals, though (https://github.com/pybind/pybind11/issues/1572).

benbovy commented 1 year ago

xref

https://github.com/pybind/pybind11/discussions/4375 https://github.com/pybind/pybind11/issues/376

benbovy commented 1 year ago

Nanobind has a low-level instance interface that could make things much easier: https://github.com/wjakob/nanobind/blob/master/docs/lowlevel.md. However, it doesn't support automatic vectorization (yet?).

benbovy / spherely

C++ <-> Python conversion performance #3