PerlFFI / FFI-Platypus

Write Perl bindings to non-Perl libraries with FFI. No XS required.
91 stars 24 forks source link

Using different versions of the same library simultaneously ? #393

Open Yaribz opened 1 year ago

Yaribz commented 1 year ago

Hello,

I'm trying to use FFI::Platypus to load two libraries, but I encounter weird problems... Both libraries work perfectly through FFI::Platypus as long as I don't load the other one. But as soon as I do, it crashes or doesn't produce the expected behavior, depending on which version I load first. I'm not sure whether it is important, but these two libraries have lots of functions with same name (they are actually two versions of the same library, and have the same SONAME).

I managed to reduce the Perl code triggering the problem to just a few lines:

use FFI::Platypus;

# If the following lines are uncommented, libtest1 will be loaded and unloaded just after.
# But then, the Init call on libtest2 below will fail with error 'std::bad_cast' (it happens
# in the library itself: the exception is caught by the library and returned by the
# GetNextError call).
# If these lines remain commented out, libtest2 will work correctly.
#
#if(my $dlHdl=FFI::Platypus::DL::dlopen('./libtest1.so',FFI::Platypus::DL::RTLD_PLATYPUS_DEFAULT())) {
#  FFI::Platypus::DL::dlclose($dlHdl);
#}else{
#  die "Failed to load library 1\n";
#}

my $ffi=FFI::Platypus->new(lib => './libtest2.so');

my $r_funcInit=$ffi->function('Init',['bool','int'],'int');
my $r_funcGetNextError=$ffi->function('GetNextError',[],'string');

if(! $r_funcInit->(0,0)) {
  print "LIBRARY ERROR - ".$r_funcGetNextError->()."\n";
}

Here I'm not even trying to use both libraries simultaneously. I just load one, then unload it without using it, and then load and use the second one. When I try to actually use both libraries, it usually segfaults on exit.

For the record these libraries are used a lot, simultaneously, by other programs which don't encounter this problem. I even made a small C++ program myself which loads them manually with dlopen/dlsym, and it works perfectly. So in C++ I'm able to use both of them simultaneously but in Perl I can't.

I guess I'm doing something wrong in my Perl code but I can't find it, any help would be greatly appreciated !

P.S. the libraries are from the open source SpringRTS project, I can provide them if needed

plicease commented 1 year ago

I think some (short as possible) instructions on how to reproduce this would help. The Perl code itself looks fine to me. The FFI::Platypus::DL interface is a pretty thin layer over the Unix functions (I'm assuming you aren't in Windows since your C++ program was also using dl*, but if you were there is some potential for shenanigans because of the compatibility layer). So I don't expect there to be anything wrong there.

https://github.com/PerlFFI/FFI-Platypus/blob/main/xs/DL.xs

RTLD_PLATYPUS_DEFAULT is just an alias for RTLD_LAZY (except for on Windows) and we don't use RTLD_GLOBAL by default so I wouldn't expect the two libraries to be aware of each other.

The one nit I can thing of is if libtest1 and libtest2 have a dependency on a third library that doesn't get unloaded because it gets linked to Perl or another extension somewhere.

Yaribz commented 1 year ago

Yes I'm on Linux. The libraries used in this simple test case are the versions 103.0 and 104.0 of the unitsync library from the SpringRTS project. I renamed them to simplify the test case as much as possible, but the problem is the same with their original name in distinct directories.

Steps to reproduce using my code verbatim: 1) extract the file libunitsync.so from archive https://springrts.com/dl/buildbot/default/master/103.0/linux64/spring_103.0_minimal-portable-linux64-static.7z 2) rename libunitsync.so to libtest1.so 3) extract the file libunitsync.so from archive https://springrts.com/dl/buildbot/default/master/104.0/linux64/spring_104.0_minimal-portable-linux64-static.7z 4) rename libunitsync.so to libtest2.so 5) launch my Perl code above from same directory

When the lines are commented, the output is: LIBRARY ERROR - Init: Required base file 'base/springcontent.sdz' does not exist. (this is normal, the library is expecting to find some data files to parse during initialization)

When the lines are uncommented, the output should be the same, but it is actually: LIBRARY ERROR - Init: std::bad_cast

If needed, the source code of the unitsync library is available here: v103.0, v104.0.

I can also provide the C++ code which works correctly with the two libraries, if needed.

Thanks a lot for taking a look at this !

Yaribz commented 1 year ago

Here are the ldd results for the two libraries:

yaribz@test:~$ ldd libtest1.so
        linux-vdso.so.1 (0x00007fff52b45000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007effbecae000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007effbeca4000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007effbeb60000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007effbeb3e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007effbe969000)
        /lib64/ld-linux-x86-64.so.2 (0x00007effbf169000)
yaribz@test:~$ ldd libtest2.so
        linux-vdso.so.1 (0x00007ffcf132f000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbef0d8d000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbef0c49000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbef0c27000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbef0a52000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbef1231000)
Yaribz commented 1 year ago

As said on IRC, I compared the debug outputs of the linker when using the Perl script (which produces the error) and the C++ program (which works), and I noticed a difference regarding the bindings of this function:

Perl case:
   1783808: binding file ./libtest1.so [0] to ./libtest1.so [0]: normal symbol `_ZNSs4_Rep20_S_empty_rep_storageE'
   1783808: binding file ./libtest1.so [0] to ./libtest1.so [0]: normal symbol `_ZNSbIwSt11char_traitsIwESaIwEE4_Rep20_S_empty_rep_storageE'
   1783808: binding file ./libtest2.so [0] to ./libtest1.so [0]: normal symbol `_ZNSs4_Rep20_S_empty_rep_storageE'
   1783808: binding file ./libtest2.so [0] to ./libtest1.so [0]: normal symbol `_ZNSbIwSt11char_traitsIwESaIwEE4_Rep20_S_empty_rep_storageE'

C++ case:
   1784898: binding file /lib/x86_64-linux-gnu/libstdc++.so.6 [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol `_ZNSs4_Rep20_S_empty_rep_storageE' [GLIBCXX_3.4]
   1784898: binding file /lib/x86_64-linux-gnu/libstdc++.so.6 [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol `_ZNSbIwSt11char_traitsIwESaIwEE4_Rep20_S_empty_rep_storageE' [GLIBCXX_3.4]
   1784898: binding file ./libtest1.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol `_ZNSs4_Rep20_S_empty_rep_storageE'
   1784898: binding file ./libtest1.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol `_ZNSbIwSt11char_traitsIwESaIwEE4_Rep20_S_empty_rep_storageE'
   1784898: binding file ./libtest2.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol `_ZNSs4_Rep20_S_empty_rep_storageE'
   1784898: binding file ./libtest2.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol `_ZNSbIwSt11char_traitsIwESaIwEE4_Rep20_S_empty_rep_storageE'

In Perl case libtest2 is bound to the symbol from libtest1 whereas in C++ case both libs bind to libstdc++. So today I tried to run the Perl script with LD_PRELOAD=/lib/x86_64-linux-gnu/libstdc++.so.6, and indeed it solves the issue, the Perl script is no longer failing...

Yaribz commented 1 year ago

The problem seems to be related to the fact that both libtest1 and libtest2 are statically linked to libstdc++6, but the version of this library or the glibc version used aren't necessarily the same...

Here is what I think is happening in Perl case after analyzing LD debug outputs:

When I run the C++ program, the libstdc++6 library is already loaded before trying to load libtest1 and libtest2, so both libraries bind to the dynamic library instead of using their local symbols and there is no problem.

When I set LD_PRELOAD to /lib/x86_64-linux-gnu/libstdc++.so.6 before running the Perl script, it does the same: it forces libtest1 and libtest2 to bind to libstdc++6 instead of using their statically linked code, and it works like in the C++ case. But it's just a workaround, it should be possible to use the statically linked code of the libraries...

Out of curiosity I also tried using the RTLD_GLOBAL dlopen mode when opening libtest1, instead of the default RTLD_LOCAL mode. This allowed the Perl script to run a bit better but there were still errors later when trying to use the libraries, depending on the library load orders etc.

Dynamic linker debug traces analysis

For reference, here are the relevant parts of the very verbose LD debug outputs regarding two selected libstdc++6 symbols, which I will just call basic_stringbuf and empty_rep_storage for simplicity.

When the Perl script is executed as this, without any change (i.e. libtest1 is loaded in default mode RTLD_PLATYPUS_DEFAULT):

symbol=vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest1.so [0]
  => binding file ./libtest1.so [0] to ./libtest1.so [0]: normal symbol

symbol=std::string::_Rep::_S_empty_rep_storage;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest1.so [0]
  marking ./libtest1.so [0] as NODELETE due to unique symbol
  => binding file ./libtest1.so [0] to ./libtest1.so [0]: normal symbol

symbol=vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest2.so [0]
  => binding file ./libtest2.so [0] to ./libtest2.so [0]: normal symbol

symbol=std::string::_Rep::_S_empty_rep_storage;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest2.so [0]
  => binding file ./libtest2.so [0] to ./libtest1.so [0]: normal symbol'

In the first two paragraphs we see that libtest1 bindings are using the statically linked code for both symbols as expected. However in the last two paragraphs we see that libtest2 is only using its own symbol for basic_stringbuf, not for empty_rep_storage. For empty_rep_storage it binds to libtest1, although the last lookup file was libtest2.so... Maybe this is related to the NODELETE due to unique symbol message above ? (only occurrence of this message in each debug traces).  

When the Perl script is executed with RTLD_GLOBAL (i.e. libtest1 is loaded with RTLD_PLATYPUS_DEFAULT | RTLD_GLOBAL):

symbol=vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest1.so [0]
  => binding file ./libtest1.so [0] to ./libtest1.so [0]: normal symbol

symbol=std::string::_Rep::_S_empty_rep_storage;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest1.so [0]
  marking ./libtest1.so [0] as NODELETE due to unique symbol
  => binding file ./libtest1.so [0] to ./libtest1.so [0]: normal symbol

symbol=vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest1.so [0]
  => binding file ./libtest2.so [0] to ./libtest1.so [0]: normal symbol

symbol=std::string::_Rep::_S_empty_rep_storage;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
  lookup in file=/lib/x86_64-linux-gnu/libcrypt.so.1 [0]
  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
  lookup in file=./libtest1.so [0]
  => binding file ./libtest2.so [0] to ./libtest1.so [0]: normal symbol

The only difference here is that libtest2 is entirely relying on libtest1 for the libstdc++6 symbols, whereas it was mostly using its own statically linked code previously. The Perl script works a bit better but still crashes when trying to actually use the libraries.

When the Perl script is executed with LD_PRELOAD=/lib/x86_64-linux-gnu/libstdc++.so.6:

symbol=vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
  => binding file ./libtest1.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol

symbol=_std::string::_Rep::_S_empty_rep_storage;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
  => binding file ./libtest1.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol

symbol=vtable for std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
  => binding file ./libtest2.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol

symbol=_std::string::_Rep::_S_empty_rep_storage;
  lookup in file=perl [0]
  lookup in file=/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
  => binding file ./libtest2.so [0] to /lib/x86_64-linux-gnu/libstdc++.so.6 [0]: normal symbol

Both libtest1 and libtest2 use the symbols from libstdc++ instead of using their statically linked code. This is not ideal but at least it seems to work.

I guess I can add the following two lines of code at the start of my Perl code to use this workaround without relying on LD_PRELOAD:

if(my $dlHdl=FFI::Platypus::DL::dlopen('libstdc++.so.6',RTLD_PLATYPUS_DEFAULT | RTLD_GLOBAL)) {
  FFI::Platypus::DL::dlclose($dlHdl);
}

But it would be really better if there was a way to just actually use the static code from the libraries themselves without triggering some weird conflict, especially because some distributions such as Alpine don't provide libstdc++ by default...

Yaribz commented 1 year ago

Apparently it is the expected behavior to have only one std::string::_Rep::_S_empty_rep_storage symbol in memory, as it is used to represent the empty string and pointers to it must be detectable (see the explanations in this issue for details). So I'm not sure what is the problem exactly here. I checked both libtest1 and libtest2 and they both have the u flag for the std::string::_Rep::_S_empty_rep_storage symbol as expected, and as shown in the debug traces in my previous message the symbol is shared as expected...

plicease commented 1 year ago

I don't think this is a Platypus bug and unfortunately related to the challenges of loading a C++ library from C program. I suspect that if your C++ program were re-written as C and linked using the C linker that you'd get the same error. I think these are the options:

  1. Use LD_PRELOAD, this is obviously not good in almost any circumstances
  2. Build Perl with the C++ compiler making Perl a C++ application, this is possible with newer versions of Perl, but depending on how this is deployed and used you may not have control over how Perl is built.
  3. Build libtest1 and libtest2 to dynamically link against libstdc++.so instead of statically (which may not be easy or even possible, or again you may not have control over that part step)
  4. Use the RTLD_GLOBAL trick before opening libtest1 and libtest2.
plicease commented 1 year ago

I'm going to tag this as "Documentation" and leave it open because I think a useful FAQ could probably be synthesized from this. Thanks @Yaribz for reporting.

Note: I think there is a more generic problem with multiple C++ libs, they don't have to be different versions of the same.

Yaribz commented 1 year ago

I suspect that if your C++ program were re-written as C and linked using the C linker that you'd get the same error

That's exactly what I wanted to check before closing the issue :) And I just managed to reproduce the problem with pure C code, so I can confirm this is not related to FFI::Platypus at all.

I'm going to tag this as "Documentation" and leave it open because I think a useful FAQ could probably be synthesized from this. Thanks @Yaribz for reporting.

Note: I think there is a more generic problem with multiple C++ libs, they don't have to be different versions of the same.

Indeed, I chose the issue title when I though the problem was related to conflicts involving symbols defined by the libraries themselves, not by their statically linked libstdc++ versions. I guess something like "Caveats regarding libraries statically linked to libstdc++" makes more sense if you want to put this in a FAQ, although it doesn't seem to be directly related to FFI::Platypus...

One question though: when I was investigating the problem I wanted to try the RTLD_DEEPBIND dlopen flag for libtest2, to force usage of its internal symbols preferably. However it seems the dlopen mode used by FFI::Platypus is hardcoded to RTLD_PLATYPUS_DEFAULT and not customizable by the user, or maybe I missed something ? I'm not sure it would help anyway in my case, as there are a lot of symbols with unique flag which should prevent RTLD_DEEPBIND to fully work I guess, but maybe it could be useful in other cases ?

edit: I just tested with RTLD_PLATYPUS_DEFAULT | RTLD_DEEPBIND and I confirm it doesn't solve my issue.

plicease commented 1 year ago

One question though: when I was investigating the problem I wanted to try the RTLD_DEEPBIND dlopen flag for libtest2, to force usage of its internal symbols preferably. However it seems the dlopen mode used by FFI::Platypus is hardcoded to RTLD_PLATYPUS_DEFAULT and not customizable by the user, or maybe I missed something ? I'm not sure it would help anyway in my case, as there are a lot of symbols with unique flag which should prevent RTLD_DEEPBIND to fully work I guess, but maybe it could be useful in other cases ?

Yeah I chose to not include that functionality in the FFI::Platypus interface. You can still reach for it if you are using ::DL interface when you really need it though it would be admittedly awkward doing your own symbol resolution. This keeps the higher level interface less complicated and more portable (aside from the phony RDLD_PLATYPUS_DEFAULT all of these values are platform dependent). I'm open to revisiting this decision, but I haven't in practice had to reach for that knob yet myself.

plicease commented 1 year ago

Indeed, I chose the issue title when I though the problem was related to conflicts involving symbols defined by the libraries themselves, not by their statically linked libstdc++ versions. I guess something like "Caveats regarding libraries statically linked to libstdc++" makes more sense if you want to put this in a FAQ, although it doesn't seem to be directly related to FFI::Platypus...

Yeah true it isn't really Platypus itself, but it is a thing that you can run into when using Platypus which I try to document where possible.