BlueBrain / libsonata

A python and C++ interface to the SONATA format
https://libsonata.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
11 stars 12 forks source link

libsonata wheels do not mix well with non-wheel NEURON #273

Closed olupton closed 11 months ago

olupton commented 1 year ago

It seems that mixing libsonata from pip install libsonata with a neuron that was built with a standard (i.e. not-manylinux) toolchain leads to problems. See, for example, on BB5, where module load neuron is a Spack-built non-wheel NEURON installation:

set -e
module purge
module load unstable python
if [ ! -d venv ]; then
  python -m venv venv
fi
. ./venv/bin/activate
pip install -U pip
pip install libsonata
python_code="import libsonata; libsonata.SimulationConfig.from_file('simulation_config.json')"
echo "Testing with standalone Python"
python -c "${python_code}"
echo "Testing via NEURON"
module load neuron
# This should segfault
! nrniv -pyexe ${PWD}/venv/bin/python -python -c "${python_code}"
module load gdb
gdb --eval-command=run --eval-command=bt --args nrniv -pyexe ${PWD}/venv/bin/python -python -c "${python_code}"

crashes with a backtrace

#0  0x000000000064eeb8 in ?? ()
#1  0x00007fffe2fd3b8e in std::__detail::_Scanner<char>::_Scanner(char const*, char const*, std::regex_constants::syntax_option_type, std::locale) ()
   from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#2  0x00007fffe2fe6330 in std::__detail::_Compiler<std::regex_traits<char> >::_Compiler(char const*, char const*, std::locale const&, std::regex_constants::syntax_option_type) ()
   from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#3  0x00007fffe2fe70a6 in ?? () from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#4  0x00007fffe2fac0f6 in ?? () from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#5  0x00007fffe2fad6cd in ?? () from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#6  0x00007fffe2f927fe in ?? () from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#7  0x00007fffe2f30b25 in ?? () from /gpfs/bbp.cscs.ch/home/olupton/libsonata-issue/venv/lib/python3.10/site-packages/libsonata/_libsonata.cpython-310-x86_64-linux-gnu.so
#8  0x00007fffed0dd523 in cfunction_call (func=0x7fffe31a6de0, args=<optimized out>, kwargs=<optimized out>) at Objects/methodobject.c:543
#9  0x00007fffed0952e0 in _PyObject_MakeTpCall (tstate=tstate@entry=0x56d060, callable=callable@entry=0x7fffe31a6de0, args=args@entry=0x7fffe320e190, nargs=1, keywords=keywords@entry=0x0)
    at Objects/call.c:215
#10 0x00007fffed03ee84 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775809, args=0x7fffe320e190, callable=0x7fffe31a6de0, tstate=0x56d060) at ./Include/cpython/abstract.h:112
#11 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775809, args=0x7fffe320e190, callable=0x7fffe31a6de0, tstate=0x56d060) at ./Include/cpython/abstract.h:99
#12 PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775809, args=<optimized out>, callable=0x7fffe31a6de0) at ./Include/cpython/abstract.h:123
#13 call_function (kwnames=0x0, oparg=1, pp_stack=<synthetic pointer>, trace_info=0x7fffffffbac0, tstate=0x56d060) at Python/ceval.c:5891
#14 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4181
#15 0x00007fffed18a9bd in _PyEval_EvalFrame (throwflag=0, f=0x7fffe320e020, tstate=0x56d060) at ./Include/internal/pycore_ceval.h:46
--Type <RET> for more, q to quit, c to continue without paging--
#16 _PyEval_Vector (kwnames=0x0, argcount=0, args=0x0, locals=0x56d060, con=0x7fffffffbb60, tstate=0x56d060) at Python/ceval.c:5065
#17 PyEval_EvalCode (co=co@entry=0x7fffe315b680, globals=globals@entry=0x7fffe31225c0, locals=locals@entry=0x7fffe31225c0) at Python/ceval.c:1134
#18 0x00007fffed1ced19 in run_eval_code_obj (locals=0x7fffe31225c0, globals=0x7fffe31225c0, co=0x7fffe315b680, tstate=0x56d060) at Python/pythonrun.c:1291
#19 run_mod (mod=mod@entry=0x5ff318, filename=filename@entry=0x7fffe3126030, globals=globals@entry=0x7fffe31225c0, locals=locals@entry=0x7fffe31225c0, flags=flags@entry=0x0,
    arena=arena@entry=0x7fffe30e1b30) at Python/pythonrun.c:1312
#20 0x00007fffed1d0625 in PyRun_StringFlags (str=str@entry=0x7fffffffc6de "import libsonata; libsonata.SimulationConfig.from_file('simulation_config.json')", start=start@entry=257,
    globals=0x7fffe31225c0, locals=0x7fffe31225c0, flags=flags@entry=0x0) at Python/pythonrun.c:1183
#21 0x00007fffed1d068b in PyRun_SimpleStringFlags (command=0x7fffffffc6de "import libsonata; libsonata.SimulationConfig.from_file('simulation_config.json')", flags=0x0) at Python/pythonrun.c:503
#22 0x00007fffed7eef57 in nrnpython_start (b=<optimized out>)

If we pip install neuron or pip install neuron-nightly instead, it doesn't crash. @WeinaJi can provide more details on the original/full example, where the symptom was a std::bad_alloc, but it appears that the underlying issue is the same as in the above example.

We suspect this is the same issue as https://stackoverflow.com/questions/51382355/stdregex-and-dual-abi#comment89745113_51385122; both libsonata and NEURON use std::regex, and in the problematic configuration then libsonata is built in a manylinux image with the old pre-C++11 std::string ABI, while NEURON is built in a standard toolchain with the C++11 std::string ABI. We found that building NEURON with the old ABI (-D_GLIBCXX_USE_CXX11_ABI=0) made the issue go away and, as noted above, using a NEURON wheel (which also uses the pre-C++11 ABI thanks to manylinux the issue is also not seen.

I also note that, taking the reproducer from the stackoverflow link, whether I get a segfault or a std::bad_alloc depends on which way around I link new.o and old.o.

Secondly, if we try to mitigate the above by using both libsonata and neuron from manylinux wheels (or by forcing the old ABI when building NEURON), there is a problem with the HDF5 version. libsonata bundles a version inside its wheel, but this appears to just be a shared library rather than a full installation that one can build against. This means that if one wants to build a .mod file that needs to be linked against HDF5 (e.g. via libsonatareport) then it is necessary to build/link against some other HDF5, but this will then fail at runtime if the "other" HDF5 does not match the bundled version in libsonata.

Needless to say, trying to have two different copies of a library like libhdf5.so in flight at the same time in the same executable does not seem like a good idea. Given that the wheels mangle library names, it seems like it's probably fragile even if the version numbers of hdf5 are the same in both cases. Tangentially related: https://github.com/BlueBrain/libsonata/issues/115.

[I experimented a bit with https://gcc.gnu.org/wiki/Visibility to see those features were sufficient to insulate libsonata and neuron from each other w.r.t. the above issues, but I didn't immediately succeed, and that might well be a misguided idea.]

Clearly the "good" answer is always "use a real package manager", so that there is only one version of each library in the dependency graph, and a single toolchain/ABI is used, but perhaps there is a worthwhile avenue to explore that would patch up the pip world...

mgeplf commented 1 year ago

We have an internal ticket with something similar (NSETM-2167). My diagnosis, at the time, was conflicting c++ implementations of std::regex internals, which I think is what you're saying @olupton. I'm not sure it's a std::string issue, though, as that was (IIRC/IIUC) the flag day change from pre-c++11 to post c++11. I'm seeing std::__cxx11::basic_string in the wheel we're building (from https://files.pythonhosted.org/packages/3f/eb/cd94edf85a47f5bb5ceb89d37a9330b8dd0c80c1dbe891fc6e35cd588d71/libsonata-0.1.20-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl); which is also a manylinux2014 one.

I also note that, taking the reproducer from the stackoverflow link, whether I get a segfault or a std::bad_alloc depends on which way around I link new.o and old.o.

This also comes up on the order neuron is loaded vs libsonata; see NSETM-2167 above.

We found that building NEURON with the old ABI (-D_GLIBCXX_USE_CXX11_ABI=0) made the issue go away and, as noted above, using a NEURON wheel (which also uses the pre-C++11 ABI thanks to manylinux the issue is also not seen.

Could you describe this more? Perhaps using GLIBCXX_USE_CXX11_ABI turns off how libc++'s std::regex is implemented? Also, are you saying that manylinux (vs manylinux2014?) is doing something different?

I'd really like to have this work more cleanly; however, I'm not sure what a good course of action is: I'd prefer to stick with std::regex rather than a 3rd party library, even though that would sidestep the issue. What are your plans for the NEURON wheel and std::regex?

As for hdf5, I think we compiled it statically at some point, with a newer version, but that would cause problems if someone was loading h5py if my memory serves - however, I could be wrong.

olupton commented 1 year ago

I don't immediately know why we see symbols for the C++11 ABI std::string inside the wheels, but we do also see symbols from the old ABI. I suppose that the _GLIBCXX_USE_CXX11_ABI flag doesn't mandate whether new/old strings are used internally(?), just what is exposed to the outside world, so maybe modern GCC + _GLIBCXX_USE_CXX11_ABI=0 still uses modern strings privately..?

I was being a bit sloppy with manylinux vs. manylinux2014 and meant manylinux2014 throughout; older manylinux versions are obsolete and I believe the newer version, manylinux_2_28, (that we do not use) sidesteps this issue by using the newer ABI, at the cost of reduced compatibility. I believe the underlying issue is that manylinux2014 is based on CentOS7, and CentOS7/RHEL7 use the old ABI (https://developers.redhat.com/blog/2020/10/08/migrating-c-and-c-applications-from-red-hat-enterprise-linux-version-7-to-version-8#). It is apparently not possible to switch to the new ABI in CentOS7 (https://bugzilla.redhat.com/show_bug.cgi?id=1546704). To some approximation, I believe that building manually using a modern GCC + -D_GLIBCXX_USE_CXX11_ABI=0 on BB5 is equivalent to what happens inside a modern manylinux2014 build. See https://gcc.gnu.org/onlinedocs/gcc-12.3.0/libstdc++/manual/manual/using_dual_abi.html for background on the dual ABI in libstdc++.

I believe the manylinux2014 builds rely on Red Hat's developer toolset to get modern compilers on a prehistoric base image; the latest manylinux2014 images use GCC10 (https://github.com/pypa/manylinux/blame/afc7a6d48bb3c583029d8ba044b5b7d7edb10c0d/build.sh#L37) and it looks like libsonata uses these -- but this is a moving target within manylinux2014, and the current NEURON wheels are built with GCC9 (NEURON wheels are built from a manually-built and only occasionally updated manylinux2014-derived image; https://github.com/neuronsimulator/nrn/pull/2140 is trying to update to GCC10). (As of version 9, Red Hat were still saying that C++17 is only supported if the same GCC/devtoolset version was used throughout: https://access.redhat.com/documentation/en-us/red_hat_developer_toolset/9/html/user_guide/chap-gcc#sect-GCC-CPP-Compatibility-ABI-CPP, which was surely not respected by all manylinux2014-ish wheels on PyPI from C++17 projects, given that the manylinux2014 image changed major version behind the scenes).

See also: https://github.com/neuronsimulator/nrn/issues/1963 for more discussion of related issues causing problems in NEURON.

My interpretation of this is that the implementation of std::regex in libstdc++ may not be sufficiently pedantic about using __cxx11 and friends that linking code that uses std::regex + _GLIBCXX_USE_CXX11_ABI=0 with code that uses std::regex + _GLIBCXX_USE_CXX11_ABI=1 can work, but I have not carefully checked this hypothesis. There seem to be a few other issues floating around apart from the StackOverflow link from before, e.g. https://github.com/pytorch/pytorch/issues/50779.

re: future plans:

Regarding hdf5, I don't know of a good solution within the pip/manylinux model. If there are multiple pieces of code that use HDF5 that need to be loaded into the same process at the same time (h5py, libsonata, .mod files inside NEURON) then I believe those really need to be using the same version/copy/instance of the hdf5 library to be robust, and that this does not fit with auditwheel's magic. In other words, the h5py and libsonata wheels are not manylinux2014-compliant because they ship external libraries.

mgeplf commented 1 year ago

After some helpful extra explanations from @olupton; I have finally seen the light.

1) manylinux2014 uses an older toolchain, which seems to have reminants of the old CoW std::string (not that _GLIBCXX_USE_CXX11_ABI is being set explicitly, but this seems to be the implicit behavior). The fact that std::__cxx11::basic_string is a red herring. 2) Internally, std::regex seems to depend on this old std::string

Probably the best we can do is augment the github actions to also build wheels w/ manylinux_2_28, which will have only the newer std::string, and thus be compatible with newer tools chains. In theory, these wheels will be automatically be picked up by pip, and installed when the C++ support libraries are available.

The BB5 case is special; it only has 2_17; but people use newer toolchains - this is what happened in NSETM-2167, and in @WeinaJi's case. There isn't much we can do about that, it appears that pip won't see these libraries, and will fall back to the /lib64/ ones, and thus won't install the manylinux_2_28 ones, even if they are built. In that case, there is a chance one could install the newer libraries, but then one has to make sure that the newer c++ support libraries from the toolchain are loaded first - this is an exercise for the reader.

mgeplf commented 1 year ago

Not sure there's anything else we can do but produce manylinux_2_28 wheels, which we are now doing with the latest release: https://pypi.org/project/libsonata/#files

I'm going to close this now, @olupton if you have any other ideas, let me know and we can reopen. Thanks.

anilbey commented 1 year ago

Hello, in case it will be useful: the wheel NEURON also triggers the issue.

I am on an Ubuntu 22.04 with pip installed NEURON and libsonata.

❯ which nrnivmodl
/home/tuncel/.virtualenvs/py311-bluecellulab/bin/nrnivmodl
❯ pip list | grep NEU
NEURON                        8.2.2
❯ pip list | grep libson
libsonata                     0.1.22

Output:

...
> /home/tuncel/.virtualenvs/py311-bluecellulab/lib/python3.11/site-packages/bluepysnap/simulation.py(61)__init__()
-> self._simulation_config_path = Path(config).absolute()
(Pdb) n
> /home/tuncel/.virtualenvs/py311-bluecellulab/lib/python3.11/site-packages/bluepysnap/simulation.py(62)__init__()
-> self._config = SimulationConfig.from_config(config)
(Pdb) n
free(): invalid pointer
Fatal Python error: Aborted
mgeplf commented 1 year ago

Can you check whether pip is resolving to: libsonata-0.1.22-cp311-cp311-manylinux_2_28_x86_64.whl or libsonata-0.1.22-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

The latter will be needed if you're using NEURON 8.2.2, based on: https://pypi.org/project/NEURON/#files

anilbey commented 1 year ago

Ah, ok. I have the former.

PATH_TO_VENV/lib/python3.11/site-packages/libsonata-0.1.22.dist-info ❯ cat WHEEL Wheel-Version: 1.0 Generator: bdist_wheel (0.40.0) Root-Is-Purelib: false Tag: cp311-cp311-manylinux_2_28_x86_64

mgeplf commented 1 year ago

Yeah, the libc versions of 2.28 and 2.17 are incompatible; since you're on a new Ubuntu, pip will try and install the 2.28 wheels, but since none exist for NEURON, you're getting a mismatch.

It's hacky, but it should work, to install the libsonata 2.17 wheel.

mgeplf commented 11 months ago

Not sure we'll come to a better conclusion on this, so I'll close it.