Open benbovy opened 1 year ago
Tracked this down in pybind11's type_caster_generic::cast
.
The bottleneck seems that pybind11 keeps a map of all registered instances and first tries to find and return the corresponding instance from the map, creating a new one if not found.
In the example above (more generally in geography creation functions), we know that no instance already exists, we could skip this step.
EDIT:
Other potential optimizations:
type_caster_generic::cast
handles multiple inheritance via values_and_holders
. This is unlikely necessary for Geography classes and we may want to skip creating 1-sized std::vector
objects (+ executing other logic) each time a Python geography object is created.pybind11::detail::get_type_info()
, which is not necessary here.Note also: the benchmark above may be biased as pybind11 probably optimizes the empty py::object()
case.
Now a quick benchmark to evaluate the overhead of Python -> C++ conversion:
int dummy(PyObjectGeography obj) {
// equivalent to `obj.cast<Geography*>()` where obj is a `py::object`
obj.as_geog_ptr();
return 0;
}
int dummy_no_cast(PyObjectGeography obj) {
return 0;
}
int get_dimensions(PyObjectGeography obj) {
return obj.as_geog_ptr()->dimension();
}
...
m.def("dummy", py::vectorize(&dummy));
m.def("dummy_no_cast", py::vectorize(&dummy_no_cast));
m.def("get_dimensions", py::vectorize(&get_dimensions));
Results:
x = np.random.rand(100_000)
y = np.random.rand(100_000)
points = s2shapely.create(x, y)
%timeit s2shapely.dummy_no_cast(points)
# 542 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit s2shapely.dummy(points)
# 8.39 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s2shapely.get_dimensions(points)
# 8.93 ms ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Significant overhead here too.
After more investigation, it looks like most of the conversion overhead is coming from calls to pybind11::detail::get_type_info()
to get the C++ type from a Python type or vice-versa. This helper function looks into pybind11 registered mappings of C++ / Python types and is called in various places like py::isinstance<T>(handle)
, py::object::cast()
, py::cast()
, etc.
In our case (at least for the Python -> C++ conversion within a vectorized function operating on geographies), we don't need to perform this lookup for each array element. It is enough (and almost free!) to just extract the pointer to the C++ object from the Python object and cast it to the Geography base class:
auto inst = reinterpret_cast<py::detail::instance *>(py_obj.ptr());
return reinterpret_cast<Geography*>(inst->simple_value_holder[0]);
"Just extract" is not really right, we also want to check first if we can safely cast the Python object to a Geography object (or raise), and check that in an efficient way. Unfortunately, this doesn't seem to be supported in pybind11. Maybe we could wrap pybind11::detail::get_type_info()
and cache the results so the lookup is done only once.
This still uses pybind11's internals, though (https://github.com/pybind/pybind11/issues/1572).
Nanobind has a low-level instance interface that could make things much easier: https://github.com/wjakob/nanobind/blob/master/docs/lowlevel.md. However, it doesn't support automatic vectorization (yet?).
I did a quick benchmark to evaluate the overhead of C++ -> Python conversion:
Here are the results:
The cast version is almost 3x slower than the no cast version. That's a lot for "just" wrapping the C++ geography in a Python object. Not sure what's happening (the Geography (sub)classes only implement move semantics, so there shouldn't be any data copy at this level), but there must be something wrong.