Closed traversaro closed 10 months ago
libgomp <= 12 is used
Indeed, it seems that the problematic piece of code was only introduced in libgomp 13 : https://github.com/gcc-mirror/gcc/commit/9f2fca56593a2b87026b399d26adcdca90705685 .
Based on the patch you found, it seems to have something to do with parsing OMP_*
environment variables?
I saw that the ipopt-feedstock sets
# Environment variables needed by spral
# See https://github.com/ralna/spral#usage-at-a-glance
export OMP_CANCELLATION=TRUE
export OMP_PROC_BIND=TRUE
In particular, from the commit you linked that introduced the new facility for host vs. device, it seems to me that:
OMP_PROC_BIND
changed substantially (as opposed to OMP_CANCELLATION
)OMP_PROC_BIND
, but rather things like: "spread"
, "close"
, "spread,spread"
, "spread,close"
I am not sure this is related to ipopt/spral. The environment in which this happens reported in https://github.com/conda-forge/ctng-compilers-feedstock/issues/114#issue-1893934215 is created with mamba create -n testsegfault libgomp python
, and in that environment no OMP_*
variable are defined.
While things clearly shouldn't break, the code does warn for invalid values, so trying to rebuild the affected stack against libgomp 13.x would probably be a good idea.
Just to understand, which stack? The problem occurs just by combining libgomp and python, and I do not think that python depends on libgomp .
OK, sorry about that. I followed your "downstream issue" a bit, that's why I got to ipopt. If this happens purely with python+libgomp, then I'm more stumped (I thought it was something about setting/parsing the OMP_*
options). I fail to imagine how the commit you referenced would touch the ABI, but perhaps that's the case. Might be interesting to rebuild python with gcc 13 to see if that changes anything?
I found another issue that contains a segfault in libgomp's initialize_env() https://github.com/weechat/weechat/issues/2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.
I reproduced the issue in Debian and Ubuntu distro with apt-packages that contain gomp 13, while earlier distros with gomp 12 all pass fine: https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6172933871. On the other hand, Fedora 38 has gomp 13.2.0, but does not reproduce the error, similarly also latest arch does not reproduce the problem.
I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.
The issue here seems to happen even with PHP. I wonder if it happens in general when using dlopen
I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.
The issue here seems to happen even with PHP. I wonder if it happens in general when using
dlopen
I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).
I found another issue that contains a segfault in libgomp's initialize_env() weechat/weechat#2009 , if I got it correctly it happens again with libgomp 13.2.0 , but with Fedora 39.
The issue here seems to happen even with PHP. I wonder if it happens in general when using
dlopen
I tested with casadi, and the issue did not happened when using a simple C++ example (I tested https://github.com/casadi/casadi/blob/main/docs/examples/cplusplus/ipopt_nl.cpp).
Just to be sure I created a minimal C-based test, and indeed the issue does not appear to happen with that, see https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/actions/runs/6174144211 and https://github.com/traversaro/reproduce-python-gomp-deepbind-issue/blob/main/test.c .
I was able to reproduce the problem without libgomp, just with a manually coded shared lib, i.e. testso.c
:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h> // Include this header for environ
extern char **environ; // Declare extern environ
static void __attribute__((constructor))
initialize_env (void)
{
char **env;
fprintf(stderr, "Print debug\n", *env);
env = environ;
fprintf(stderr, "environ %p env %p\n", env, environ);
for (env = environ; *env != 0; env++)
{
fprintf(stderr, "%s\n", *env);
}
return;
}
gcc -shared -fPIC testso.c -o testso.so
(testsegfault) traversaro@IITICUBLAP257:~/test_ipopt_dir$ python -c "import ctypes; import os; ctypes._dlopen('./testso.so', os.RTLD_DEEPBIND)"
Print debug
environ (nil) env (nil)
Segmentation fault
While in normal use:
Trying to load with RTLD_LAZY|RTLD_DEEPBIND ./testso.so
Print debug
environ 0x7ffcf93cefc0 env 0x7ffcf93cefc0
For some reason the environ
global variable is set to 0
/NULL
.
So perhaps we should move the issue to Python feedstock?
Ok, I think this is the combination of two different behaviour/problems:
environ==NULL
environ==NULL
environ==NULL
is a valid state, for example caused by calling clearenv()
on Linux.P2 can be reproduced easily on libgomp >= 13 with this MWE:
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
int main () {
clearenv();
void * handle = dlopen("libgomp.so.1", RTLD_NOW);
if (handle) {
fprintf(stderr, "dlopen of libgomp.so.1 done correctly.\n");
return EXIT_SUCCESS;
} else {
fprintf(stderr, "dlopen of libgomp.so.1 failed with error: %s.\n", dlerror());
return EXIT_SUCCESS;
}
return EXIT_SUCCESS;
}
to run:
gcc -ldl test_gomp_segfault.c -o test_gomp_segfault
./test_gomp_segfault
I will open a bug upstream in GCC for P2.
I will open a bug upstream in GCC for P2.
I will open a bug upstream in GCC for P2.
The issue was fixed upstream for GCC14, see:
The patch is huge, but avoiding to to indentation changes it can be summarized to single line change, that for backport can be more adapt to reduce the risk of patch conflicts.
Great job!
The patch is huge, but avoiding to to indentation changes it can be summarized to single line change
Proof of that statement, using Github's UI.
P1: constructors of shared library opened by dlopen with RTLD_DEEPBIND on Python on conda-forge/Debian have
environ==NULL
* I am not sure why this happens, and if it is expected behaviour or a bug
It turns that also this was working fine in gomp <= 12 and it does not work in gomp 13, so I opened an issue also for that: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111556 . However, to be honest I am not sure if this is a problem in libgomp, in glibc or simply a problem of how ELF and the POSIX spec interact.
Solution to issue cannot be found in the documentation.
Issue
If I try to dlopen with RTLD_DEEPBIND from a Python environment libgomp 13.*, I obtain a segfault. A simple reproducer is just the command
python -c "import ctypes; import os; ctypes._dlopen(os.environ['CONDA_PREFIX']+'/lib/libgomp.so.1', os.RTLD_DEEPBIND)"
:The issue does not appear if:
The backtrace is the following:
and seems to indicate that something is going wrong around https://github.com/gcc-mirror/gcc/blob/releases/gcc-13.2.0/libgomp/env.c#L2062 . I have a few ideas to investigate this further, like debugging the value of the
environ
global variable, but I am not sure when I will have time for this, so in the meanwhile I opened this issue.Downstream issue: https://github.com/conda-forge/casadi-feedstock/issues/91 .
Installed packages
Environment info