conan-io / conan-center-index

Recipes for the ConanCenter repository
https://conan.io/center
MIT License
943 stars 1.71k forks source link

[question] Shared building issues when linking on Linux (due to runpath/LD_LIBRARY_PATH issue)? #23421

Open irieger opened 5 months ago

irieger commented 5 months ago

What is your question?

Hey,

I some time ago run into a problem building several shared libraries of larger projects building more than one artifact. The problem I run into - after hours of debugging - I think I condensed down to a core issue, that I'm trying to understand.

So the question is basically: How does shared linking and LD_LIBRARY_PATH / CMAKE_LIBRARY_PATH work?

To describe the problem on a concrete example, I can reference this PR: https://github.com/conan-io/conan-center-index/pull/23112

Here I try to give a brief example CMakeLists.txt excerpt, that roughly I think describes the problem:

...
find_package(minizip-ng)
...

add_library(a ...)
target_link_libraries(a PRIVATE  minizip-ng)

add_executable(exe ...)
target_link_libraries(exe  PRIVATE a)

Running conan create ... with this CMake results in a problem when linking target exe. Here is the corresponding output:

/usr/bin/ld: /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4: undefined reference to `libiconv_open'
/usr/bin/ld: /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4: undefined reference to `libiconv_close'
/usr/bin/ld: /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4: undefined reference to `libiconv'

Doing some analysis with ldd, readelf -d etc. I found that liba.so (in the real case libOpenColorIO.so.2.3.2) finds all it dependencies but libiconv.so.2. Output of ldd:

        linux-vdso.so.1 (0x00007fffb1bde000)
        libexpat.so.1 => /root/.conan2/p/b/expatb2443c2409e5b/p/lib/libexpat.so.1 (0x0000772c128db000)
        libImath-3_1.so.29 => /root/.conan2/p/b/imath93646a610b69b/p/lib/libImath-3_1.so.29 (0x0000772c12888000)
        libpystring.so => /root/.conan2/p/b/pystread4a2c558974/p/lib/libpystring.so (0x0000772c12873000)
        libyaml-cpp.so.0.8 => /root/.conan2/p/b/yaml-8766cd17dd26d/p/lib/libyaml-cpp.so.0.8 (0x0000772c127ee000)
        libminizip-ng.so.4 => /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4 (0x0000772c127ca000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000772c1259b000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000772c124b4000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000772c12494000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000772c1226b000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x0000772c1223e000)
        libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x0000772c1216f000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x0000772c11d2b000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x0000772c11d0f000)
        libbz2.so.1 => /lib/x86_64-linux-gnu/libbz2.so.1 (0x0000772c11cfc000)
        /lib64/ld-linux-x86-64.so.2 (0x0000772c12f03000)
        libiconv.so.2 => not found
# readelf -d src/OpenColorIO/libOpenColorIO.so

Dynamic section at offset 0x5e95e0 contains 34 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libexpat.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libImath-3_1.so.29]
 0x0000000000000001 (NEEDED)             Shared library: [libpystring.so]
 0x0000000000000001 (NEEDED)             Shared library: [libyaml-cpp.so.0.8]
 0x0000000000000001 (NEEDED)             Shared library: [libminizip-ng.so.4]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libOpenColorIO.so.2.3]
 0x000000000000001d (RUNPATH)            Library runpath: [/root/.conan2/p/b/expatb2443c2409e5b/p/lib:/root/.conan2/p/b/imath93646a610b69b/p/lib:/root/.conan2/p/b/pystread4a2c558974/p/lib:/root/.conan2/p/b/yaml-8766cd17dd26d/p/lib:/root/.conan2/p/b/miniz3c27d6cd524ad/p/lib:/root/.conan2/p/b/zlibb9a035bf5348c/p/lib:/root/.conan2/p/b/bzip2d0969c6a4a573/p/lib:/root/.conan2/p/b/xz_utb63605fe5c051/p/lib:/root/.conan2/p/b/zstd4a9f6ad3dde0f/p/lib:/root/.conan2/p/b/opens939255f8546e8/p/lib:/root/.conan2/p/b/libic16d33c41345b9/p/lib:]

So the RUNPATH contains the directory for iconv (/root/.conan2/p/b/libic16d33c41345b9/p/lib). But as it isn't a direct dependency, that obviously doesn't help. And /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4 - which is from a package folder - has the runpath stripped.

Now when we link exe against liba.so, liba.so can find libminizip-ng.so.4 thanks to the runpath, but libminizip-ng.so.4 can't solve its calls to libiconv.so.

How can this be solved properly? I get it in a hacky way via doing either one of the two following changes to the CMakeLists.txt:

  1. target_link_libraries(a PRIVATE minizip-ng) to target_link_libraries(a PUBLIC minizip-ng)
  2. target_link_libraries(exe PRIVATE a) to target_link_libraries(exe PRIVATE a minizip-ng)

This will result in the package building correctly.

I did some similar changes in a branch to OpenColorIO & OpenImageIO - which I both consume - and run my local build with this branch instead of CCI-master. With this change both projects build correctly which the didn't before. But I think that is not the actual solution, isn't it?

The conan_toolchain.cmake (for conan create recipes/opencolorio/all --version..) would look like this, so I'd assume that the search path is handed to the linker, but it isn't:

...
list(PREPEND CMAKE_LIBRARY_PATH "/root/.conan2/p/b/expatb2443c2409e5b/p/lib" "/root/.conan2/p/b/opene56b3a1143d500/p/lib" "/root/.conan2/p/b/libde3e59198e1f030/p/lib" "/root/.conan2/p/b/imath93646a610b69b/p/lib" "/root/.conan2/p/b/pystread4a2c558974/p/lib" "/root/.conan2/p/b/yaml-8766cd17dd26d/p/lib" "/root/.conan2/p/b/miniz3c27d6cd524ad/p/lib" "/root/.conan2/p/b/bzip2d0969c6a4a573/p/lib" "/root/.conan2/p/b/xz_utb63605fe5c051/p/lib" "/root/.conan2/p/b/zstd4a9f6ad3dde0f/p/lib" "/root/.conan2/p/b/opens939255f8546e8/p/lib" "/root/.conan2/p/b/zlibb9a035bf5348c/p/lib" "/root/.conan2/p/b/libic16d33c41345b9/p/lib" "/root/.conan2/p/b/lcmsdd1c7c13c7b3a/p/lib")
...

P.S. I play around with that in a docker. Here is some sample code I use, although currently I did a large set of changes locally to debug but the change in the repo should show the actual problem. It needs to use modifications of the opencolorio recipe that actually allows shared building. https://github.com/irieger/cpp-playground/tree/main/ocio-conan-dynamic-linking

jcar87 commented 4 months ago

Hi @irieger - thanks for reporting this issue.

There are two aspects of this - what happens at link time (where cmake passes the -rpath flag), and what happens at runtime.

I've been unable to reproduce the link-time issues:

...
find_package(minizip-ng)
...

add_library(a ...)
target_link_libraries(a PRIVATE  minizip-ng)

add_executable(exe ...)
target_link_libraries(exe  PRIVATE a)

I'm surprised by this, because the minizip-ng recipe is set to be used like:

find_package(minizip REQUIRED CONFIG)
target_link_libraries(my_library PRIVATE MINIZIP::minizip)

Even using it like this, I'm unable to reproduce the link time issue, that is:

/usr/bin/ld: /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4: undefined reference to `libiconv_open'
/usr/bin/ld: /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4: undefined reference to `libiconv_close'
/usr/bin/ld: /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4: undefined reference to `libiconv'

With a minimal CMakeLists like:

cmake_minimum_required(VERSION 3.27)

project(MyProject LANGUAGES C CXX)

find_package(minizip REQUIRED CONFIG)

# Library that has a public function, where the implementation makes calls to minizip functionality
add_library(my_library my_lib.c)
target_link_libraries(my_library PRIVATE MINIZIP::minizip)

# Executable that calls my_library public function - minizip is not seen here as it is a private depedendency
add_executable(my_executable example.c)
target_link_libraries(my_executable PRIVATE my_library)

CMake is able to correctly build and link this, even when minizip (and libiconf) are shared libraries - that should be the case regardless, as far as I can see. Running the executable poses some issues finding libiconf, but that's a different (and relatively easy to solve) story. Do you have a reproducible example of being unable to link when building the executable?

Doing some analysis with ldd, readelf -d etc. I found that liba.so (in the real case libOpenColorIO.so.2.3.2) finds all it dependencies but libiconv.so.2. Output of ldd:

Bear in mind that ldd may not produce correct results - in order for the runtime linker to find the right dependencies at runtime, you need to activate conanrun.sh, which will set LD_LIBRARY_PATH.

That's why you're seeing things like:

 libminizip-ng.so.4 => /root/.conan2/p/b/miniz3c27d6cd524ad/p/lib/libminizip-ng.so.4 (0x0000772c127ca000)

being resolved correctly (likely a direct dependency of your executable, where the library is found in one of the RUNPATH entries)

but you're also getting:

        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x0000772c1223e000)
        libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x0000772c1216f000)
        libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x0000772c11d2b000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x0000772c11d0f000)
        libbz2.so.1 => /lib/x86_64-linux-gnu/libbz2.so.1 (0x0000772c11cfc000)

these should come from conan, and NOT the system as is your case

or this

libiconv.so.2 => not found

this is because the iconv symbols are part of glibc so it's not going to fall back on a system equivalent library.

The reason this is happening is that the linker is embedding the rpath (as provided by CMake) in the DT_RUNPATH dynamic tag, which only works for direct dependencies.

I'd suggest trying to launch your executable again after activating conanrun.sh which sets LD_LIBRARY_PATH correctly.

Alternatively, you may try adding add_link_options("-Wl,--disable-new-dtags") before defining your targets in CMake - this will cause CMake to tell the linker to use DT_RPATH instead (legacy behaviour), which should work for resolving indirect dependencies as well. Although bear in mind that this is legacy behaviour and may be discouraged in security environments - it would be fine during development, but if you package your application with their dependencies, it would be good to use DT_RUNPATH instead.

irieger commented 4 months ago

An example is here: https://github.com/irieger/cpp-playground/tree/main/ocio-conan-dynamic-linking The given link shares a docker for having a clean build env & a modified recipe for OCIO that allows building it as a shared library with all its dependencies also shared.

irieger commented 4 months ago

In all this tests, I care about the build/link stage, not execution. For execution I either load the env or package everything and run patchelf to set the path accordingly. There everything works fine if I have the build stage hacked to work. (For one project I currently use the hack that I activate the runenv for building, so that due to the env-var of LD_LIBRARY_PATH the linker doesn't fail with the cascaded dependencies.)

jcar87 commented 4 months ago

An example is here: https://github.com/irieger/cpp-playground/tree/main/ocio-conan-dynamic-linking The given link shares a docker for having a clean build env & a modified recipe for OCIO that allows building it as a shared library with all its dependencies also shared.

Thanks! Is it possible that this is missing the CMakeLists.txt and main.cpp files? From: https://github.com/irieger/cpp-playground/blob/main/ocio-conan-dynamic-linking/conanfile.py#L15

irieger commented 4 months ago

Yeah, it is somewhat hacked. As the sample script running the docker shows, I only called conan install on this to reproduce the failure state. That conanfile.py right now is just a glorified "command list" for conan install in a sense.

jcar87 commented 4 months ago

Thanks @irieger - I've been able to reproduce this with your repository.

I have a question: this only fails with the changes you have made to the opencolorio recipe - in particular, you have lifted the restriction that minizip-ng is a static dependency. Otherwise Conan would have alerted you of an options mismatch and required that minizip-ng was built statically.

note that this mirrors what is actually intended by the maintainers of opencolor-io, that is, there seems to be an underlying assumption that minizip-ng is static, and the symbols are excluded: https://github.com/AcademySoftwareFoundation/OpenColorIO/blob/67a26e4a383d2125bae437432cdad535d55c751b/src/OpenColorIO/CMakeLists.txt#L383-L412

irieger commented 4 months ago

Oh, interesting point. I tried to find a hint for minizip needing to be static as there is no clear description in the recipe why. But didn't saw or connected this part. On my side so far it compiled & linked, but the way I currently use OpenImageIO, I don't think I run into the codepaths which would use the library. (Compiled & linked with removing the static restriction plus setting the linking to public so that the library path is also given to the tools.)

Also on Linux (at least on archlinux) the official OpenColorIO links dynamically to libminizip-ng.so.1 => /usr/lib/libminizip-ng.so.1 (0x0000796f4ee0c000). (Which looks to be build without iconv which doesn't show up here.)

I'd like to understand if it is really needed to enforce the static & will also confirm with the maintainer who I'm chatting with in the OCIO slack channel. One of the main problems is that this means no shared CI build for this and expecially for other packages depending on OCIO, which is for example OpenImageIO, which also has the same linking issues. So CI will miss all linking problems in everything using this as the shared builds are skipped.

With both OpenColorIO & OpenImageIO fixed to build fully shared, I have so far not run into any runtime issue in the parts I use.

irieger commented 4 months ago

So I got feedback that it isn't intended to enfore static and was also hinted that also brew normally dynamcially links.

EstebanDugueperoux2 commented 4 months ago

Hi @irieger and @jcar87,

I reproduce similar issue (https://github.com/EstebanDugueperoux2/openimageio_shared_issue) and got also errors with your PR. See all logs in https://github.com/EstebanDugueperoux2/openimageio_shared_issue/blob/main/build.log

Regards.

irieger commented 4 months ago

Thanks for taking a look.

...got also errors with your PR. You mean the PR mentioned in the opener (#23112)?

With that I got OpenColorIO building with everything shared. Not OpenImageIO. That needed a comparable change.

I have a branch where I added #23112 basically as well as changing more or less all target_link_libraries calls for OpenImageIO to PUBLIC and have both build well with everything shared. Running this for my projects since around the time I opened this ticket I think. Builds and runs on macOS and Linux (not building & running my stuff on Windows so never tested there). At least no issues so far, but with the small subset of functions of OpenImageIO I currently use, I'm not sure that I'd touch any of the code paths that was affected by the linker issue but I think missing .so files normally fail to run already, so assume it should work.

Any idea how we could properly improve that or what the underlying problem might be? Why is a linker even caring what a so that is being linked into my consumer is consuming itself? Always assumed the linker just tries to solve any of the external symbols of "my" code, so why does it need to resolve things that library is already resolved against? I'd assume this work was already done. Would really like to learn a bit more also while obviously also I'd like to have it properly solved upstream if it makes sense to reduce my hacks.

jcar87 commented 4 months ago

Any idea how we could properly improve that or what the underlying problem might be? Why is a linker even caring what a so that is being linked into my consumer is consuming itself? Always assumed the linker just tries to solve any of the external symbols of "my" code, so why does it need to resolve things that library is already resolved against? I'd assume this work was already done. Would really like to learn a bit more also while obviously also I'd like to have it properly solved upstream if it makes sense to reduce my hacks.

When linking an executable (and interestingly, doesnt happen when creating shared libraries), the BFD linker (the default linker) will try to locate the dependencies of your dependencies, sort of replicating the behaviour of the runtime loader. This only happens with the BFD linker as fas as Im aware. On macOS and Windows, the behaviour is what you describe. On Linux, even other linkers don't perform this search, if I remember correctly. So chances are if you use the gold linker, lld, or mold, it may actually link.

According to the ld documentation, if it doesn't locate the dependencies of your dependencies, it issues a warning and continues. However, then you get those undefined reference errors because it's trying to solve the symbols for your dependencies, even the ones you don't directly use.

The relevant parts of the ld documentations are:

-rpath-link=dir When using ELF or SunOS, one shared library may require another. This happens when an "ld -shared" link includes a shared library as one of the input files. When the linker encounters such a dependency when doing a non-shared, non-relocatable link, it will automatically try to locate the required shared library and include it in the link, if it is not included explicitly.

--allow-shlib-undefined --no-allow-shlib-undefined Allows or disallows undefined symbols in shared libraries. This switch is similar to --no-undefined except that it determines the behaviour when the undefined symbols are in a shared library rather than a regular object file. It does not affect how undefined symbols in regular object files are handled. The default behaviour is to report errors for any undefined symbols referenced in shared libraries if the linker is being used to create an executable, but to allow them if the linker is being used to create a shared library.

What happens usually in CMake projects is that if CMake has enough information about transitive dependencies of your dependencies, it will pass -rpath-link or -rpath for the linker to perform this search successfully (unless you are explicitly linking your transitive dependencies, which is exactly what happens by changing the PRIVATE to PUBLIC in https://github.com/conan-io/conan-center-index/pull/23112/files, but this has other downsides ). I suspect this is what is currently the issue.

If you want to try and work around this:

note that in this scenario, if CMake doesn't embed the RPATH into the executables, then the only way to run these executables is by using the conanrun.sh environment that defines LD_LIBRARY_PATH

But you are absolutely right, we need to locate the source of the problem :) - which I suspect is that CMake does not have the right information at the right time to pass the correct -rpath/-rpath-link to the linker.

EstebanDugueperoux2 commented 4 months ago

Hi @jcar87,

thanks for your analysis and your proposed workaround. I have tested following config in my ~/.conan2/global.conf and it works fine now:

` tools.build:exelinkflags = ["-fuse-ld=gold"]

tools.build:sharedlinkflags = ["-fuse-ld=gold"] `

Regards.

jcar87 commented 4 months ago

Thanks @EstebanDugueperoux2 - glad to hear this works, as it confirms the issue. For what it's worth, I suspect the linker will still produce exactly the same files (assuming the gold linker is able to).

Will keep this issue open as this needs to be investigated properly in the Conan side.

irieger commented 2 months ago

So I just thought I do try to patch the OpenColorIO with a new MR reduced to the minimum of needed changes and left the old one as is. Funnily, it works with Conan 1 to link everything but not Conan 2. Maybe that helps. Hadn't looked so close before but was just confused getting a CI pass for Conan 1: https://github.com/conan-io/conan-center-index/pull/24395#issuecomment-2179388048

irieger commented 2 months ago

Just a general note: As I think I had mentioned a while ago in Slack, I followed @EstebanDugueperoux2 advice and run with mold as the linker in my profile and my stuff already works perfectly well locally with my local build pipeline since somewhere in May. So at least that solved it for me although I prefer to run as much as possible on upstream rather than needing to rebase my local changes to recipes.