Open anjos opened 5 years ago
I'm suspecting the following patch at the llvm-suite parent package: 0001-If-libc-abi-library-is-given-use-it-to-reexport.patch
This patch may affect cmake-based builds and is outdated following a discussion here: https://reviews.llvm.org/D53797
In particular, the following quote before dismissing this patch is worrying: I don't think we ever want to re-export the current system's libc++abi -- we should always use an explicit list of exported symbols.
Could you either update the patch or instruct me how to rebuild this package to test it?
@mingwandroid Could you please give us some pointers here? As far as I understand two packages with the different build numbers only should be API/ABI compatible especially for something like libc++abi.
Could you please give us some pointers here? As far as I understand two packages with the different build numbers only should be API/ABI compatible especially for something like libc++abi.
No that is only true for packages that implement semver and even then there's times when it's necessary to break the ABI between patch releases (changing some build option can cause this).
I'm too busy to look into this until later, but in this case that patch number does represent an ABI break due to a fix we needed for exceptions. It's not clear what's going on here but I suspect its to do with mixing the system libc++ with ours and passing objects between them which is not supported.
Here is the outcome of a few experiments I conducted:
So, the problem really seems related to the runtime of version 4.0.1-1, since once we deploy version 4.0.1-0, the problems go away, even if our binary is compiled against 4.0.1-1.
Now, I rebuild the "llvm-suite" from scratch (only took 7 hours on my laptop...) to remove this patch. I'll call this "version 4.0.1-2". After installing version 4.0.1-2, the problems go away again and everything works as expected. So, the patch in question is really in the center of the issue!
Thanks @anjos,
The patch is redundant relative to the latest llvm/clang master, but we're building llvm/clang 4.0.1 here where it is not redundant. This patch is essential for C++ exceptions to work correctly. Pinging @isuruf.
The most sensible way forward that I can see is for us to update our macOS compilers to a very recent one and add some tests to the compiler packages for both C++ exceptions and this issue.
Can someone make the smallest possible reproducer for it though? That would be super useful.
I'll try to proritize it as soon as we get such a reproducer. It's still not 100% clear to me, despite the evidence presented that this isn't a bug in the code in question (though I admit that is less likely).
The libc++ team make guarantees about ABI compatibility that they appear not to be reaching :-(
Yeah, this looks like a problem of mixing libc++ libraries. Can you do export DYLD_PRINT_LIBRARIES="1"
and rerun your script to figure out which libc++abi.dylib and libc++.dylib are loaded?
Here is output of the program when that variable is set.
A few notes:
Is this possibly an error in our own build instructions? Not sure how to make setuptools link against the conda version of libc++ explicitly.
@anjos, try doing
export LDFLAGS="-L<path_to_conda_env>/lib -Wl,-rpath,<path_to_conda_env>/lib"
so that setuptools links against libc++ inside the conda env.
@isuruf Isn't this flag automatically exported when you activate the compilers?
In general, linking to /usr/lib/libc++.dylib means that the system compilers got used instead of ours.
Typically that happens when you neglect to pass --host=${HOST}
to configure. (HOST
is set by the compiler activation scripts, I wish I'd picked a less common name though, CONDA_HOST
for example, so if we need to change this at some point I apologise in advance).
As far as I can understand from the discussions in this issue looks like there are two problems:
Am I right?
The libc++ 4.0.1-1 package is not abi compatible with the system one. It's probably not abi compatible with the older conda package either. Maybe we can remove it from the channel index until we have more information?
It's the system compilers you need to stop using here! We have no evidence to suggest our packages are not abi compatible between build numbers, but that's irrelevant, since the deps are exactly the same so the solver will always pick the newer build number.
edit: ignore this comment!
You need to run otool -l
on all the packages involved here and find those that link to /usr/lib/libc++.dylib
and fix that.
edit: .. and this one.
Looking at the txt file, although it is loading two sets of libc++
dylibs, I'm not sure that's the issue here.
The system one gets loaded from Python through the fact that some of Apple's system libs are written in C++ (in fact the actual dynamic loader is written in C++, but it must be statically linked) so I'm thinking conflicting libc++'s is a red-herring now.
So my latest crazy theory is that the libc++ ABI 'leaks' into the headers.
Does anyone have a stacktrace for the segfault? Can you try to recompile the package in which the crashing function lives and also the package of the caller of that function?
I scanned the whole setup.
I found only a single dylib linking to /usr/lib/libc++.1.dylib:
/Users/andre/conda/envs/bug/lib/libiomp5_db.dylib
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.1.0)
This is also present in the load.txt file I submitted before, so somehow, our runtime is loading it.
I'll run more tests, but it may be related to this library.
Very interesting ...
Any idea what package that comes from? Can you find /Users/andre/conda/pkgs -name libiomp5_db.dylib
It's from MKL, can you try updating to 2019.1?
Ran the mkl=2019.1 update, here is the list of packages updated:
The following packages will be REMOVED:
bob-devel: 2018.12.11-py36_0 https://www.idiap.ch/software/bob/conda
The following packages will be UPDATED:
mkl: 2018.0.3-1 defaults --> 2019.1-144 defaults
mkl_fft: 1.0.6-py36hb8a8100_0 defaults --> 1.0.6-py36h27c97d8_0 defaults
mkl_random: 1.0.1-py36h5d10147_1 defaults --> 1.0.2-py36h27c97d8_0 defaults
numpy: 1.15.1-py36h6a91979_0 defaults --> 1.15.4-py36hacdab7b_0 defaults
numpy-base: 1.15.1-py36h8a80b8c_0 defaults --> 1.15.4-py36h6575580_0 defaults
scipy: 1.1.0-py36h28f7352_1 defaults --> 1.1.0-py36h1410ff5_2 defaults
The problem persists, but that library is still linked to /usr/lib/libc++.1.dylib
. It is the only one on the whole stack. Looking closely, I realise the library does not come from the mkl package, but rather from intel-openmp-2019.1-144
, which is installed any way with both versions 2018 and 2019 of mkl.
I changed that dylib file manually for a quick test using the following command:
$ install_name_tool -change /usr/lib/libc++.1.dylib "@rpath/libc++.1.dylib" /Users/andre/conda/envs/bug/lib/libiomp5_db.dylib
otool -L
on that library now shows me:
$ otool -L ~/conda/pkgs/intel-openmp-2019.1-144/lib/libiomp5_db.dylib
/Users/andre/conda/pkgs/intel-openmp-2019.1-144/lib/libiomp5_db.dylib:
@rpath/libiomp5_db.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 120.1.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1225.1.1)
Scanning all libraries, I don't see anymore anyone linked specifically against /usr/lib/libc++.1.dylib
. Re-running the example I have in hands, still gives me the crash though.
If I check my own libraries (the ones within the package itself), I see they don't link against the system's libc++.1.dylib:
$ otool -L bob/learn/boosting/*.so
bob/learn/boosting/_library.cpython-36m-darwin.so:
@rpath/libbob_learn_boosting.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libboost_system.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
bob/learn/boosting/version.cpython-36m-darwin.so:
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
$ otool -L bob/learn/boosting/*.dylib
bob/learn/boosting/libbob_learn_boosting.dylib:
@rpath/libbob_learn_boosting.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libbob_io_base.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libboost_system.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.50.4)
So, it looks sane to me w.r.t. linking. Nevertheless:
$ DYLD_PRINT_LIBRARIES="1" ./bin/python test.py 2>&1 | grep c++
dyld: loaded: /usr/lib/libc++abi.dylib
dyld: loaded: /usr/lib/libc++.1.dylib
dyld: loaded: /usr/lib/libc++abi.dylib
dyld: loaded: /usr/lib/libc++.1.dylib
# these come from our conda-build as you can see above
dyld: loaded: /Users/andre/conda/envs/bug/lib/libc++.1.dylib
dyld: loaded: /Users/andre/conda/envs/bug/lib/libc++abi.1.dylib
Now, just running python itself, from the environment:
$ DYLD_PRINT_LIBRARIES="1" python -c 'exit()' 2>&1 | grep c++
dyld: loaded: /usr/lib/libc++abi.dylib
dyld: loaded: /usr/lib/libc++.1.dylib
So, this does not related to our code at all and Python from the defaults channel seems to be loading the C++ libraries from the system.
Here is a minimal example to test the system library loading from scratch:
$ conda create -n pytest python=3
$ conda activate pytest
(pytest) $ DYLD_PRINT_LIBRARIES="1" python -c 'exit()' 2>&1 | grep c++
dyld: loaded: /usr/lib/libc++abi.dylib
dyld: loaded: /usr/lib/libc++.1.dylib
Here is the reasoning why that happens:
/usr/lib/libSystem.B.dylib
/usr/lib/libSystem.B.dylib
links against /usr/lib/system/libxpc.dylib
/usr/lib/system/libxpc.dylib
links against /usr/lib/libobjc.A.dylib
/usr/lib/libobjc.A.dylib
links against /usr/lib/libc++abi.dylib
Not sure this is wrong per se. Comments welcome.
Edit: now re-reading the stack @mingwandroid has already commented on this, so please ignore the request for comments.
So, at least in macOS 10.13, my current understanding is that anything linking against /usr/lib/libSystem.B.dylib
will end-up with /usr/lib/libc++abi.dylib
on their linkage list.
Edit: ignore this as well.
More information: I created a macOS 10.9 machine and compiled my software there, from scratch. The problem persists, as well as all indicators as defined above. So, we can exclude cross-compilation issues.
@anjos, try doing
export LDFLAGS="-L<path_to_conda_env>/lib -Wl,-rpath,<path_to_conda_env>/lib"
so that setuptools links against libc++ inside the conda env.
I double-checked our setup and this is exactly what is executed. The compilation line for setuptools-built bindings look like this:
x86_64-apple-darwin13.4.0-clang++ -bundle -undefined dynamic_lookup -isysroot /opt/MacOSX10.9.sdk -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-rpath,/Users/gitlab/conda/envs/bug/lib -L/Users/gitlab/conda/envs/bug/lib -isysroot /opt/MacOSX10.9.sdk -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-rpath,/Users/gitlab/conda/envs/bug/lib -L/Users/gitlab/conda/envs/bug/lib -Wl,-export_dynamic -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -O0 -g -DBOB_DEBUG -D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9 -arch x86_64 build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/main.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/loss_function.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/jesorsky_loss.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/weak_machine.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/stump_machine.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/lut_machine.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/boosted_machine.o build/temp.macosx-10.9-x86_64-3.6/bob/learn/boosting/lut_trainer.o -L/Users/gitlab/bob.learn.boosting/build/lib.macosx-10.9-x86_64-3.6/bob/learn/boosting -L/Users/gitlab/conda/envs/bug/lib -L/Users/gitlab/bob.learn.boosting/src/bob.core/bob/core -L/Users/gitlab/bob.learn.boosting/src/bob.io.base/bob/io/base -lbob_learn_boosting -lbob_core -lbob_io_base -lboost_system -lblitz -o build/lib.macosx-10.9-x86_64-3.6/bob/learn/boosting/_library.cpython-36m-darwin.so
otool -L on it shows me everything looks great:
otool -L bob/learn/boosting/*.so
bob/learn/boosting/_library.cpython-36m-darwin.so:
@rpath/libbob_learn_boosting.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libboost_system.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
bob/learn/boosting/version.cpython-36m-darwin.so:
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
Check full (recent) log here: https://gitlab.idiap.ch/bob/bob.learn.boosting/-/jobs/152778
Instead, here is something that seems to "fix it":
$ install_name_tool -change @rpath/libc++.1.dylib /usr/lib/libc++.1.dylib bob/learn/boosting/_library.cpython-36m-darwin.so
$ ./bin/python test.py
1.0 #does not crash!
So, it is really an ABI incompatibility between my generated bindings and the linked libraries.
Thanks @anjos,
The patch is redundant relative to the latest llvm/clang master, but we're building llvm/clang 4.0.1 here where it is not redundant. This patch is essential for C++ exceptions to work correctly. Pinging @isuruf.
The most sensible way forward that I can see is for us to update our macOS compilers to a very recent one and add some tests to the compiler packages for both C++ exceptions and this issue.
Can someone make the smallest possible reproducer for it though? That would be super useful.
I'll try to proritize it as soon as we get such a reproducer. It's still not 100% clear to me, despite the evidence presented that this isn't a bug in the code in question (though I admit that is less likely).
@mingwandroid: I'd be happy to test them in my setup. Minimally, we'd only need to have a build that either excludes or updates the patch below as per my initial suggestion.
I'm suspecting the following patch at the llvm-suite parent package: 0001-If-libc-abi-library-is-given-use-it-to-reexport.patch
This patch may affect cmake-based builds and is outdated following a discussion here: https://reviews.llvm.org/D53797
In particular, the following quote before dismissing this patch is worrying: I don't think we ever want to re-export the current system's libc++abi -- we should always use an explicit list of exported symbols.
Could you either update the patch or instruct me how to rebuild this package to test it?
I continued tests by tweaking compilation flags, but nothing seems to fix this. The more I look at it, the more it looks like a binary issue with the pointed out library.
@mingwandroid: I'm not sure how to provide you the smallest possible reproducer. I tried to explain how to reproduce the problem on the original report.
@isuruf, @mingwandroid: I'm afraid this is breaking our whole software stack and I'm out of ideas on where to look further. Could you please consider rebuilding libcxx/abi with the improved patch as suggested above?
@anjos, can you try libcxx-8.0.0 and libcxxabi-8.0.0 from conda-forge channel?
To use a new version of the ABI version implies the de-installation of clang=4.0.1
which makes it hard to recompile the package, so my testing may be biased.
The following changes were applied to my software stack:
The following packages will be REMOVED:
cctools-895-1
clang-4.0.1-1
clang_osx-64-4.0.1-h1ce6c1d_11
clangxx-4.0.1-1
clangxx_osx-64-4.0.1-h22b1bf0_11
ld64-274.2-1
llvm-lto-tapi-4.0.1-1
The following packages will be UPDATED:
ca-certificates pkgs/main::ca-certificates-2019.1.23-0 --> conda-forge::ca-certificates-2019.3.9-hecc5488_0
libcxx pkgs/main::libcxx-4.0.1-hcfea43d_1 --> conda-forge::libcxx-8.0.0-2
libcxxabi pkgs/main::libcxxabi-4.0.1-hcfea43d_1 --> conda-forge::libcxxabi-8.0.0-2
openssl pkgs/main::openssl-1.1.1b-h1de35cc_1 --> conda-forge::openssl-1.1.1b-h01d97ff_2
The following packages will be SUPERSEDED by a higher-priority channel:
certifi pkgs/main --> conda-forge
llvm pkgs/main::llvm-4.0.1-1 --> pkgs/free::llvm-3.3-0
A simple run after the package is upgraded (but using the previously compiled code) still produces the crash. So I cannot vouch for the new library - but again, I could not recompile the code from scratch.
You need to recompile though. Can you create a new environment and force install the 2 package without deps?
OK, using --no-deps
gets me there:
## Package Plan ##
environment location: /Users/andre/conda/envs/learn-dev
added / updated specs:
- libcxx[version='>=8']
- libcxxabi[version='>=8']
The following packages will be UPDATED:
libcxx pkgs/main::libcxx-4.0.1-hcfea43d_1 --> conda-forge::libcxx-8.0.0-2
libcxxabi pkgs/main::libcxxabi-4.0.1-hcfea43d_1 --> conda-forge::libcxxabi-8.0.0-2
I recompiled the package with the stack above, but the segmentation fault still occurs.
Otool output is still the same as in https://github.com/ContinuumIO/anaconda-issues/issues/10423#issuecomment-448581732 ?
$ otool -L bob/learn/boosting/*.so
bob/learn/boosting/_library.cpython-36m-darwin.so:
@rpath/libbob_learn_boosting.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libboost_system.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
bob/learn/boosting/version.cpython-36m-darwin.so:
@rpath/libc++.1.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
So - yes. Exactly the same.
On conda-forge, we decided not to ship libc++abi and only ship libc++, which should fix this issue.
We build packages against the defaults
channel - not conda-forge
's. Could you be more descriptive of the fix?
defaults::libcxx
's libc++.dylib
links with libc++abi.dylib
from defaults::libcxxabi
package, but conda-forge::libcxx
's libc++.dylib
links with /usr/lib/libc++abi.dylib
from macosx.
So, are you proposing we stick with conda-forge instead of defaults? Or suggesting that defaults will adopt conda-forge's linking strategy in a next release of libcxx?
Can you check that it works with conda-forge? defaults will probably adopt the same strategy, but I have no say in that.
@isuruf: I can confirm that using the following list of packages from conda-forge makes my environment work, even without a recompilation:
ca-certificates 2019.9.11 hecc5488_0 conda-forge
cctools 921 h5ba7a2e_4 conda-forge
certifi 2019.6.16 py36_1 conda-forge
clang 9.0.0 h28b9765_1 conda-forge
clang_osx-64 9.0.0 h22b1bf0_3 conda-forge
clangxx 9.0.0 1 conda-forge
clangxx_osx-64 9.0.0 h22b1bf0_3 conda-forge
compiler-rt 9.0.0 hce3ea14_0 conda-forge
ld64 409.12 h3c32e8a_4 conda-forge
libcxx 9.0.0 h89e68fa_1 conda-forge
libllvm9 9.0.0 h770b8ee_2 conda-forge
llvm 9.0.0 2 conda-forge
llvm-lto-tapi 4.0.1 1 conda-forge
openssl 1.1.1c h01d97ff_0 conda-forge
tapi 1000.10.8 h770b8ee_3 conda-forge
These packages were installed once I did conda install -c conda-forge libcxx=9
. (Note: libcxxabi==4.0.1
from defaults
was lingering after the install command above, I had to remove it manually.)
I would be nice to see (some of) these on defaults
.
@mingwandroid: is there an ETA for an update of libcxx on defaults
?
When it's ready, but I think in the meantime we could remove this lib and make a new release of libcxx. Pinging @msarahan and @jjhelmus.
Actual Behavior
One of our packages (https://gitlab.idiap.ch/bob/bob.learn.boosting) contains C++ code bound to Python via its own APIs - we don't use boost::python or anything like it. The C++ code is compiled using CMake, while the Python bindings are compiled using the normal setuptools/distutils framework. The builds are completely integrated within a call to setup.py install, which is called via conda-build. We do builds for Linux and MacOS routinely, but this problem only shows on MacOS.
Recently, we started to observe segmentation faults in this library, without any change in the code. After careful inspection, we realized that the problem was inside a std::vector<> that was created within the C++ code (compiled by CMake), and then manipulated via code compiled via setuptools. From experience, this type of problem occurs when the ABI is changed between libraries communicating complex objects (such as std::vector's are). We believe there is something strange going on with the latest version of libcxxabi and friends (build 1). After downgrading to the libcxxabi to build 0, the problem stops occurring.
Thread on our gitlab: https://gitlab.idiap.ch/bob/bob.learn.boosting/issues/2
Expected Behavior
Compiled code via setuptools or cmake should be ABI compatible and the exchange of C++ objects possible between such binaries.
Steps to Reproduce
Reproducing this problem requires you experiment with both cmake and setuptools based compilations, which is not trivial, so it is difficult to provide a small, self-contained example. Here is how to compile and reproduce the problem with the original package that shows the issue (on a MacOS machine - ours is a 10.13 system, with a 10.9 SDK installed on /opt/MacOSX10.9.sdk):
Anaconda or Miniconda version: 4.5.11
Operating System: MacOS 10.13.6
conda info
conda list --show-channel-urls