conda-forge / ctng-compilers-feedstock

A conda-smithy repository for ctng-compilers.
BSD 3-Clause "New" or "Revised" License
12 stars 25 forks source link

dlopen of libgomp 13.1.0 and 13.2.0 with RTLD_DEEPBIND on Python fail with segmentation fault on Ubuntu 22.04 #114

Closed traversaro closed 10 months ago

traversaro commented 10 months ago

Solution to issue cannot be found in the documentation.

Issue

If I try to dlopen with RTLD_DEEPBIND from a Python environment libgomp 13.*, I obtain a segfault. A simple reproducer is just the command python -c "import ctypes; import os; ctypes._dlopen(os.environ['CONDA_PREFIX']+'/lib/libgomp.so.1', os.RTLD_DEEPBIND)" :

(testsegfault) traversaro@IITICUBLAP257:~$ python -c "import ctypes; import os; ctypes._dlopen(os.environ['CONDA_PREFIX']+'/lib/libgomp.so.1', os.RTLD_DEEPBIND)"
Segmentation fault

The issue does not appear if:

The backtrace is the following:

(gdb) bt
#0  initialize_env () at ../../../libgomp/env.c:2062
#1  0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffc1f8, env=env@entry=0x7fffffffc218)
    at ./elf/dl-init.c:70
#2  0x00007ffff7fc9568 in call_init (env=0x7fffffffc218, argv=0x7fffffffc1f8, argc=3, l=<optimized out>) at ./elf/dl-init.c:33
#3  _dl_init (main_map=0x555555b8e620, argc=3, argv=0x7fffffffc1f8, env=0x7fffffffc218) at ./elf/dl-init.c:117
#4  0x00007ffff7e09c85 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at ./elf/dl-error-skeleton.c:182
#5  0x00007ffff7fd0ff6 in dl_open_worker (a=0x7fffffffb910) at ./elf/dl-open.c:808

and seems to indicate that something is going wrong around https://github.com/gcc-mirror/gcc/blob/releases/gcc-13.2.0/libgomp/env.c#L2062 . I have a few ideas to investigate this further, like debugging the value of the environ global variable, but I am not sure when I will have time for this, so in the meanwhile I opened this issue.

Downstream issue: https://github.com/conda-forge/casadi-feedstock/issues/91 .

Installed packages

(testsegfault) traversaro@IITICUBLAP257:~$ conda list
# packages in environment at /home/traversaro/miniforge3/envs/testsegfault:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2023.7.22            hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_0    conda-forge
libgomp                   13.2.0               h807b86a_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libsqlite                 3.43.0               h2797004_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
openssl                   3.1.2                hd590300_0    conda-forge
pip                       23.2.1             pyhd8ed1ab_0    conda-forge
python                    3.11.5          hab00c5b_0_cpython    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.41.2             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

(testsegfault) traversaro@IITICUBLAP257:~$ conda info

     active environment : testsegfault
    active env location : /home/traversaro/miniforge3/envs/testsegfault
            shell level : 1
       user config file : /home/traversaro/.condarc
 populated config files : /home/traversaro/miniforge3/.condarc
                          /home/traversaro/.condarc
          conda version : 23.3.1
    conda-build version : not installed
         python version : 3.10.12.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.2=0
                          __glibc=2.35=0
                          __linux=5.15.90.1=0
                          __unix=0=0
       base environment : /home/traversaro/miniforge3  (writable)
      conda av data dir : /home/traversaro/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/traversaro/miniforge3/pkgs
                          /home/traversaro/.conda/pkgs
       envs directories : /home/traversaro/miniforge3/envs
                          /home/traversaro/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.31.0 CPython/3.10.12 Linux/5.15.90.1-microsoft-standard-WSL2 ubuntu/22.04.2 glibc/2.35
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False
traversaro commented 10 months ago

libgomp <= 12 is used

Indeed, it seems that the problematic piece of code was only introduced in libgomp 13 : https://github.com/gcc-mirror/gcc/commit/9f2fca56593a2b87026b399d26adcdca90705685 .

h-vetinari commented 10 months ago

Based on the patch you found, it seems to have something to do with parsing OMP_* environment variables?

I saw that the ipopt-feedstock sets

  # Environment variables needed by spral
  # See https://github.com/ralna/spral#usage-at-a-glance
  export OMP_CANCELLATION=TRUE
  export OMP_PROC_BIND=TRUE

In particular, from the commit you linked that introduced the new facility for host vs. device, it seems to me that:

traversaro commented 10 months ago

I am not sure this is related to ipopt/spral. The environment in which this happens reported in https://github.com/conda-forge/ctng-compilers-feedstock/issues/114#issue-1893934215 is created with mamba create -n testsegfault libgomp python, and in that environment no OMP_* variable are defined.

While things clearly shouldn't break, the code does warn for invalid values, so trying to rebuild the affected stack against libgomp 13.x would probably be a good idea.

Just to understand, which stack? The problem occurs just by combining libgomp and python, and I do not think that python depends on libgomp .

h-vetinari commented 10 months ago

OK, sorry about that. I followed your "downstream issue" a bit, that's why I got to ipopt. If this happens purely with python+libgomp, then I'm more stumped (I thought it was something about setting/parsing the OMP_* options). I fail to imagine how the commit you referenced would touch the ABI, but perhaps that's the case. Might be interesting to rebuild python with gcc 13 to see if that changes anything?

traversaro commented 10 months ago

I found another issue that contains a segfault in libgomp's initialize_env() https://github.com/weechat/weechat/issues/2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

traversaro commented 10 months ago

I reproduced the issue in Debian and Ubuntu distro with apt-packages that contain gomp 13, while earlier distros with gomp 12 all pass fine: https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6172933871. On the other hand, Fedora 38 has gomp 13.2.0, but does not reproduce the error, similarly also latest arch does not reproduce the problem.

S-Dafarra commented 10 months ago

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

traversaro commented 10 months ago

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).

traversaro commented 10 months ago

I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.

The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen

I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).

Just to be sure I created a minimal C-based test, and indeed the issue does not appear to happen with that, see https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6174144211 and https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/blob/main/test.c .

traversaro commented 10 months ago

I was able to reproduce the problem without libgomp, just with a manually coded shared lib, i.e. testso.c :

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h> // Include this header for environ

extern char **environ; // Declare extern environ

static void __attribute__((constructor))
initialize_env (void)
{
    char **env;
    fprintf(stderr, "Print debug\n", *env);
    env = environ;
    fprintf(stderr, "environ %p env %p\n", env, environ);
    for (env = environ; *env != 0; env++)
    {
        fprintf(stderr, "%s\n", *env);
    }
    return;
}
gcc -shared -fPIC testso.c -o testso.so
(testsegfault) traversaro@IITICUBLAP257:~/test_ipopt_dir$ python -c "import ctypes; import os; ctypes._dlopen('./testso.so', os.RTLD_DEEPBIND)"
Print debug
environ (nil) env (nil)
Segmentation fault

While in normal use:

Trying to load with RTLD_LAZY|RTLD_DEEPBIND ./testso.so
Print debug
environ 0x7ffcf93cefc0 env 0x7ffcf93cefc0

For some reason the environ global variable is set to 0/NULL.

So perhaps we should move the issue to Python feedstock?

traversaro commented 10 months ago

Ok, I think this is the combination of two different behaviour/problems:

P2 can be reproduced easily on libgomp >= 13 with this MWE:

#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

int main () {
    clearenv();
    void * handle = dlopen("libgomp.so.1", RTLD_NOW);

    if (handle) {
        fprintf(stderr, "dlopen of libgomp.so.1 done correctly.\n");
        return EXIT_SUCCESS;
    } else {
        fprintf(stderr, "dlopen of libgomp.so.1 failed with error: %s.\n", dlerror());
        return EXIT_SUCCESS;
    }
    return EXIT_SUCCESS;
}

to run:

gcc -ldl test_gomp_segfault.c -o test_gomp_segfault
./test_gomp_segfault

I will open a bug upstream in GCC for P2.

traversaro commented 10 months ago

I will open a bug upstream in GCC for P2.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111413

traversaro commented 10 months ago

I will open a bug upstream in GCC for P2.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111413

The issue was fixed upstream for GCC14, see:

The patch is huge, but avoiding to to indentation changes it can be summarized to single line change, that for backport can be more adapt to reduce the risk of patch conflicts.

h-vetinari commented 10 months ago

Great job!

The patch is huge, but avoiding to to indentation changes it can be summarized to single line change

Proof of that statement, using Github's UI.

traversaro commented 9 months ago

P1: constructors of shared library opened by dlopen with RTLD_DEEPBIND on Python on conda-forge/Debian have environ==NULL

* I am not sure why this happens, and if it is expected behaviour or a bug

It turns that also this was working fine in gomp <= 12 and it does not work in gomp 13, so I opened an issue also for that: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111556 . However, to be honest I am not sure if this is a problem in libgomp, in glibc or simply a problem of how ELF and the POSIX spec interact.