Closed jawsc-alan closed 1 month ago
This is connected to changes in numpy type casting implemented in these two numpy PRs: https://github.com/numpy/numpy/pull/17401 and https://github.com/numpy/numpy/pull/18676.
For hacking around with the Numpy sources the following Dockerfile might be convenient:
where minipy.cpp
is as given by @jawsc-alan and repro.py
is either of the Python snippets above. Then in an interactive container, rebuilding numpy is done by:
$ source /mc3/etc/profile.d/conda.sh
$ cd /numpy/numpy
$ conda activate build2
$ NPY_USE_NEW_CASTINGIMPL=1 python setup.py build_ext -i
After numpy commit ba77419b10, this procedure gives the segfault described here; then after e108c76 the NPY_USE_NEW_CASTINGIMPL
variable is not needed to reproduce it because the new casting became the only option.
Not sure where to take it from here, but I hope this is helpful...
Made some progress on this. First things first, this Dockerfile will provide debug symbols for numpy, boost_python, python, etc:
Here is the segfault:
[root@701ff080ecdc lib]# gdb --args python repro.py
(gdb) run
Starting program: /usr/bin/python repro.py
Program received signal SIGSEGV, Segmentation fault.
PyArray_GetCastingImpl (from=0x7f4cdff00d20 <boost::python::class_metatype_object>, to=to@entry=0x559ccd752120) at numpy/core/src/multiarray/convert_datatype.c:77
77 res = PyDict_GetItemWithError(NPY_DT_SLOTS(from)->castingimpls, (PyObject *)to);
This is an excerpt from the function body:
/**
* Fetch the casting implementation from one DType to another.
*
* @params from
* @params to
*
* @returns A castingimpl (PyArrayDTypeMethod *), None or NULL with an
* error set.
*/
NPY_NO_EXPORT PyObject *
PyArray_GetCastingImpl(PyArray_DTypeMeta *from, PyArray_DTypeMeta *to)
{
PyObject *res;
if (from == to) {
res = NPY_DT_SLOTS(from)->within_dtype_castingimpl;
}
else {
res = PyDict_GetItemWithError(NPY_DT_SLOTS(from)->castingimpls, (PyObject *)to);
}
if (res != NULL || PyErr_Occurred()) {
Py_XINCREF(res);
return res;
}
In my understanding from
and to
are data types that have been implicitly cast to PyArray_DTypeMeta
here. The macro NPY_DT_SLOTS
gets expanded to this: #define NPY_DT_SLOTS(dtype) ((NPY_DType_Slots *)(dtype)->dt_slots)
. It looks like the new-style numpy dtypes all have that data member dt_slots
but boost_python types don't. So this is the expected behavior of from
and to
:
(gdb) p *((NPY_DType_Slots *)(to)->dt_slots)
$16 = {
discover_descr_from_pyobject = 0x7faf0f412c80 <nonparametric_discover_descr_from_pyobject>,
is_known_scalar_type = 0x7faf0f412f60 <python_builtins_are_known_scalar_types>,
default_descr = 0x7faf0f412c90 <nonparametric_default_descr>,
common_dtype = 0x7faf0f4132a0 <default_builtin_common_dtype>,
common_instance = 0x0, setitem = 0x0, getitem = 0x0,
within_dtype_castingimpl = <numpy._ArrayMethod at remote 0x7faf0f1ff030>,
castingimpls = {<numpy._DTypeMeta at remote 0x56020c63ab00>: <numpy._ArrayMethod at etc etc etc
and this is what we get instead when from
is a boost class:
(gdb) p *((NPY_DType_Slots *)(from)->dt_slots)
Cannot access memory at address 0x0
So it really looks like a numpy bug in the new type casting implementation. I'm not sure exactly what should be the correct behavior, it appears there's a legacy system that has been preserved so maybe that is still usable for types that don't have a data member dt_slots
. Will reflect on this more but maybe opening a numpy issue is the right path forward.
This fixes it! Will open a numpy PR.
diff --git a/numpy/core/src/multiarray/convert_datatype.c b/numpy/core/src/multiarray/convert_datatype.c
index 3135d6989..8c8e2c2aa 100644
--- a/numpy/core/src/multiarray/convert_datatype.c
+++ b/numpy/core/src/multiarray/convert_datatype.c
@@ -73,6 +73,9 @@ PyArray_GetCastingImpl(PyArray_DTypeMeta *from, PyArray_DTypeMeta *to)
if (from == to) {
res = NPY_DT_SLOTS(from)->within_dtype_castingimpl;
}
+ else if (NPY_DT_SLOTS(from) == NULL) {
+ res = NULL;
+ }
else {
res = PyDict_GetItemWithError(NPY_DT_SLOTS(from)->castingimpls, (PyObject *)to);
}
If an urgent fix is needed it's possible to clone numpy sources, apply this diff, rebuild numpy and symlink the result inside e.g. conda_base/lib/python3.8/site-packages/
or equivalent.
cc @rgommers (for awareness)
Maybe floating it here is more helpful: The fix above seems like a bad idea to me, it is a small fix to cover up some of the symptoms of a much larger problem.
I am still struggling to have a good local reproducer, +but the problem comes from here: https://github.com/boostorg/python/blob/f5d14ef15e98838b3cf18692d46c481287ed50d0/src/numpy/dtype.cpp#L161-L162
The second line seems OK (if shady). The first line gets some PyArray_Descr *
that is "faulty" and probably created by boost::python
? To understand deeper I really need to understand what exactly that object is and how /where it is created... And right now I am not even sure where to look. (Sorry, my C++ is not on a high enough level to understand boost-level code well.)
NB: The equivalent
seems a bit shady because EquivTypes does not really do what it is used for here. But in practice that is probably not a problem. However, unless you are interacting with weird user defined dtypes that do shady things, there should be better ways to do this.
To be fair, it seems likely that whatever that dt
is there, the only way it is ever used is in PyArray_EquivTypes
. So adding this type of hack into PyArray_EquivTypes
(for a few releases) may well be the pragmatic thing to do.
But I would still like to understand what triggers this problem and clarify that whatever triggers means that this is an invalid dtype
.
Just to see if I can get this moving here:
equivalent
should be completely useles (even the windows-only path seems wrong). It seems to think that the scalar type is the same as the dtype, which is wrong.PyArray_EquivTypes
with Python types is incorrect API and ABI (which always returned undefined results, that will have been always False though, since a type object is much larger than a NumPy dtype, it would not have crashed)The fix is to just delete the equivalent
path (after double checking that it does nothing in windows as well maybe). If you want to fix it, the windows path does sizeof(int) == sizeof(long)
. Add the same for sizeof(long) == sizeof(long long)
and fix the path to check for get_dtype()->typeobj
.
I may add a compat hack to NumPy, but even if, that cannot stick around forever and it is clearly a boost::python
bug.
I am proposing a "fix" in NumPy in https://github.com/numpy/numpy/pull/20616. If merged, it should prevent the crash, but behave identically to just deleting the whole "equivalent
" code path. ~Also~ But expect that fix to vanish at some point...
Quite simply: this must be fixed in boost::python
, since boost::python
is using the API incorrectly (the compiler should warn about it, but I suppose there is a faulty reinterpret_cast
involved).
@stefanseefeld do you have thoughts here? 🙂
I'm looking into it. Meanwhile, @TallJimbo , do you remember the origin (and rationale) behind this code ?
Thanks. The code makes some sense: NumPy scalars can wrap e.g. C long
or long long
and depending on the platform, some of those will always end up being identical. (Also double and long double, but frankly, whatever.)
Trying to use EquivTypes
(on the descriptors rather than types) might work but is probably just a pain to get truly right. My suggestion would be to fix that windows special case (using only the scalar types, no dtypes involved) and add one for long == long long
, and hope that is good enough.
There might be some stranger things that the current code allows (e.g. subclasses or some probably non-existing user dtypes), but overall that seems like the safe bet (and probably the only safe bet, just because something is equivalent on the dtype level doesn't mean you can read its C data like you currently do).
@seberg nailed the original motivation. It is entirely possible that my use of EquivTypes
was always faulty, and I thought I was using it correctly just because it made various things work when sizeof(int) == sizeof(long)
or sizeof(long) == sizeof(long long)
. Note that if 32-bit architectures are still supported here, you probably do want to special-case the former as well.
Sorry to bring this up again, but this error is appearing again in the numpy 2.0 release candidates because the fix was removed in this commit
What can be a way forward? Thanks!
@stefanseefeld @TallJimbo @seberg @rgommers
Would it be super hard to just fix it in boost::python? I don't really like the idea of dragging around a hack in NumPy for another 3 years and 2.0 needs everyone to bump boost versions anyway?!
EDIT: To be clear: this should be easy to fix so the main question may be how hard it is to make a bug-fix release that includes it soon enough to not block projects on.
I cleared some items off my plate so I can try to solve this over the next few weeks. On conda-forge, it should be pretty easy to patch the currently pinned version of the Boost conda package since compatibility with Numpy 2.0 was previously added (https://github.com/conda-forge/boost-feedstock/pull/198). @jakirkham would know more.
I'm guessing the approach would be to change the equivalent
function to check for non-Numpy types and returning False immediately before calling PyArray_EquivTypes
?
@bkpoon, the no-op change is to just use ==
instead of the function. That might not be what you want precisely, but it is exactly what you had anyway, I think.
Hi all,
Sorry for the delay, I was traveling for a conference.
I made a pull request to reduce the check to directly comparing the pointers (thanks @seberg). It does seem to work for me. I built a patched version of the Boost conda packages and tested it on my use cases (we were getting segmentation faults for C++ arrays we were exposing to Python via Boost). It does work for us.
Prior to the fix, our Numpy 2 migration on conda-forge fails. https://github.com/conda-forge/cctbx-base-feedstock/pull/93
I adjusted that PR to pull my patched Boost libraries from my channel and the segmentation faults go away. https://github.com/conda-forge/cctbx-base-feedstock/pull/94 https://github.com/bkpoon/boost-feedstock/tree/np2_fix
The patched Boost packages are available here. https://anaconda.org/cctbx-dev
@stefanseefeld can you review the PR? I completely removed the old code path, but we can use the NPY_ABI_VERSION
check to keep the old code path for Numpy versions less than 2. Just let me know.
Thanks!
Thanks for all the work ! (And despite the actual patch to look quite simple, I have been removed from low-level Python & Numpy code in recent years that I find it increasingly difficult to review such changes.
Let's see when the next window of opportunity is to merge the develop
branch to master
...
I'll get the pull requests for boost-feedstock
so that the changes can be in the pinned and latest Boost packages. @jakirkham I should have them ready some time today or tomorrow.
@stefanseefeld Feel free to ping me if there are issues. Thanks!
Summary
The following anomaly applies to boost::python operating with NumPy version >= 1.21.0; prior to this version the degradation doesn't occur:
Symptom
If a
BOOST_PYTHON_MODULE
callsboost::python::numpy::initialize()
, boost::python's parameter type detection feature is partially-defeated. This affects any method or function type checking and is particularly noticeable in overloads. If an incorrect parameter type (such as an object) is passed to a method/function that expects an intrinsic int, float, or string, a segfault occurs instead of the helpful boost error throw. In the event of an overload, the overload-dispatcher may send a legitimate parameter to the incorrect overload instance.These cases are exercised in Python example 1 and Python example 2 below, respectively.
Relevant Software Versions Information:
For Failure:
(several were tried, this is representative)
For Success(expected behavior):
just change numpy:
C++ sample:
minipy.cpp
Build with standard boost libraries/paths. in this case for me, conda venv = compile2:
Observations:
Python example 1:
(This is fast to reproduce, so a command line python may be simpler)
Expected output:
Incorrect output:
Segmentation fault: 11
Python example 2 -- boost overloading failure:
Expected output:
Incorrect output:
Observation: It appears from this example that the numpy registration in numpy >= 1.21 is defeating boost::python's type matching for overloads. If the overloads are declared in the opposite order, the code works regardless of the numpy version. Given the last-in-first-checked order of boost::python dispatcher, add(double) is checked first in the code example above. Normally this is fine, but if numpy>=1.21 registration has occurred the m.A object is incorrectly considered convertible to double, then leading to the segmentation fault.