Segfault opening darktable

elstoc commented 2 weeks ago

Built the latest master (9920ae7c1) and get a segfault shortly afterwards. Lighttable view shows for a few seconds and then it crashes. Running on ArchLinux with nVidia, but get the same thing even when running with --disable-opencl.

Backtrace: darktable_bt_87TZN2.txt

elstoc commented 2 weeks ago

I think it might be related to color equalizer. Retried with a clean config and unedited images. As soon as I enable color equalizer on an image I get the segfault.

Ping @jenshannoschwalm?

elstoc commented 2 weeks ago

Bisect suggests 5aa29dc13fa1484856453c9e2b45faab87698067 is the first bad commit

jenshannoschwalm commented 2 weeks ago

No idia so far. Using CE daily without issues, also the code has not changed for quite a while except the new OMP macro usage. Maybe a log can give a clue -d pipe for a first try. Just recompiled everything and having no issues ...

elstoc commented 2 weeks ago

darktable-d-pipe.txt

jenshannoschwalm commented 2 weeks ago

I guess we have to go the hard way, there seem to be some uses of DT_OMP_FOR_SIMD that seem to be wrong. Could you try with DT_OMP_FOR in interpolate_bilinear found in fast_guided_filter.h that simply looks wrong to me.

TurboGit commented 2 weeks ago

Likewise, I'm using CE often.

elstoc commented 2 weeks ago

Could you try with DT_OMP_FOR in interpolate_bilinear found in fast_guided_filter.h that simply looks wrong to me.

Not entirely sure what you mean. If you mean to replace DT_OMP_FOR_SIND with DT_OMP_FOR on line 103 that didn't seem to work. Edit: looking at your latest PR seems that was what you meant. Building latest master after pulling that PR didn't change anything

piratenpanda commented 2 weeks ago

Also seeing something similar I guess:

Thread 289 "worker 5" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffdf78006c0 (LWP 28537)]
0x00007fffd810c8ce in _prefilter_chromaticity._omp_fn.3 ()
    at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
544     const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
(gdb) bt full
#0  0x00007fffd810c8ce in _prefilter_chromaticity._omp_fn.3 ()
    at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
        uv = {<optimized out>, <optimized out>}
        cv = {<optimized out>, <optimized out>}
        k = <optimized out>
        k = <optimized out>
        UV = Python Exception <class 'gdb.MemoryError'>: Cannot access memory at address 0x7ffdf77f0e08

elstoc commented 2 weeks ago

I tried removing DT_OMP_FOR(collapse(2)) entirely from the interpolate_bilinear function in fast_guided_filter.h and that did prevent the issue. So we're definitely looking in the right place.

Similarly if I replace it with the code that was removed in 5aa29dc, it works

jenshannoschwalm commented 2 weeks ago

You mean if in interpolate_bilinear you replace DT_OMP_FOR(collapse(2) by

#ifdef _OPENMP
#pragma omp parallel for collapse(2) default(none) \
  dt_omp_firstprivate(in, out, width_out, height_out, width_in, height_in, ch) \
  schedule(simd:static)
#endif

that "works"?

Pinging @ralfbrown and @dterrahe here, i don't yet fully understand that new omp macro stuff yet, what could be wrong here? Might ____DT_CLONE_TARGETS__ be a problem? Could it be we have some parameters wrongfully used as shared ?

ralfbrown commented 2 weeks ago

Clone targets shouldn't be an issue, the directive basically tells the compiler to compile the function multiple times, with a different target architecture each time. But I won't categorically rule it out.

Since all of the variables named in the original OpenMP directive are const, it wouldn't matter whether they are declared shared or private since there are no modifications to be propagated to other threads.

DT_OMP_FOR(collapse(2)) expands to #pragma omp parallel for default(firstprivate) schedule(static) collapse(2) DT_OMP_FOR_SIMD(collapse(2)) expands to #pragma omp parallel for simd default(firstprivate) schedule(simd:static) collapse(2)

@elstoc Does your copy still work correctly if you use the original directive but remove the "simd:" from it? That would give the equivalent of DT_OMP_FOR. Does it work correctly if you replace the DT_OMP_FOR with just DT_OMP_FOR()? If so, the problem is related to merging the nested loops into a mega-loop, but parallelizing just the outer loop gives nearly all of the speedup we'd get with the collapse(2).

elstoc commented 2 weeks ago

You mean if in interpolate_bilinear you replace DT_OMP_FOR(collapse(2) by...

Yes. Works with this replacement

Does your copy still work correctly if you use the original directive but remove the "simd:" from it?

Yes. Also works.

Does it work correctly if you replace the DT_OMP_FOR with just DT_OMP_FOR()?

No. Crashes.

ralfbrown commented 2 weeks ago

Hmm. Does it work if you keep the DT_OMP_FOR but remove the __DT_CLONE_TARGETS__ ? Beginning to suspect a compiler bug....

elstoc commented 2 weeks ago

Does it work if you keep the DT_OMP_FOR but remove the __DT_CLONE_TARGETS__ ?

no. BTW I also tried downgrading gcc (from 14.1.1 to 13.2.1) to no effect

jenshannoschwalm commented 2 weeks ago

Beginning to suspect a compiler bug....

Doesn't this look more like a macro problem? Are you sure the #pragma omp parallel for default(firstprivate) is fully correct?

Also - i am very surprised that the bug is so obvious for chris and i didn't have a single crash for long also using master. gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7) here

ralfbrown commented 2 weeks ago

That was going to be my next try: replace the DT_OMP_FOR with

#ifdef _OPENMP
#pragma omp parallel for default(firstprivate) schedule(static) collapse(2)
#endif

which should be exactly the pragma that DT_OMP_FOR(collapse(2)) expands into.

default(none) forces manual naming of the sharing status of all variables used by the loop, so if it compiled without errors, all of the variables have been named, and they are all firstprivate - which is what default(firstprivate) gives you without having to explicitly name the variables in the pragma.

elstoc commented 2 weeks ago

@ralfbrown that #ifdef replacement also caused failure

elstoc commented 2 weeks ago

Clutching at straws here but worth mentioning that I only build using the build.sh script, and clear both the build and install directories beforehand (just in case the people who have it working use some other mechanism).

jenshannoschwalm commented 2 weeks ago

my build script but i guess nothing special

#!/bin/sh

DTDIR=/home/hanno/sources/darktable
ANSWER=""

if [ ! -d "$DTDIR" ]; then
  echo "no darktable directory found"
  read -t 10 ANSWER
  exit
fi

cd $DTDIR

if [ -d "build" ]; then
  echo "uninstall old builds <sudo>"
  cd build
  sudo make uninstall
  make clean
  cd ..
  sudo rm -fr /home/hanno/sources/darktable/build
  sudo rm -fr /home/hanno/.cache/ccache
fi

git checkout master
git pull upstream master
git fetch upstream
git submodule update
rm -fr /home/hanno/.cache/darktable/cached_v*

./build.sh --prefix /usr/local --disable-game --disable-kwallet --disable-unity --enable-use_libraw --build-type Release --install --sudo

Have you tried building with clang, possibly ruling out a compiler problem? Last line here

export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
./build.sh --prefix /usr/local --disable-game --disable-kwallet --disable-unity --enable-use_libraw --build-type Release --install --sudo

BTW - not mentioning ArchL... here but it wouldn't be the first time :-)

elstoc commented 2 weeks ago

Have you tried building with clang

I get a bunch of CMake errors when trying to build with clang which I don't know how to resolve, so stick with gcc

piratenpanda commented 2 weeks ago

Have you tried building with clang, possibly ruling out a compiler problem? Last line here

does not change anything unfortunately. Still crashing in colorequal.c:544

jenshannoschwalm commented 2 weeks ago

A summary as i understand it for now:

There could be 1) a bug in the way we use the for default(firstprivate) pragma, - @elstoc tests seem to indicate that 2) that could be correct but there is a compiler/omp issue either depending on build or system or 3) there is a bug in CE code that triggers the issue on some systems - @piratenpanda report might pinpoint to that or 4) don't know.

@piratenpanda i might be good to report about your system here :-) (distribution, gcc version, libgomp?

There is a hypothesis i would like you both to test, could you compile with this colorequal.c replacing what we have? colorequal.zip

elstoc commented 2 weeks ago

Seems to work for me with the provided colorequal.zip

piratenpanda commented 2 weeks ago

also running arch here, gcc 14.1.1

For me colorequal.zip does not work, but bt full now is longer:

#0  0x00007fffd81af8ac in _prefilter_chromaticity._omp_fn.3 ()
    at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
        uv = {<optimized out>, <optimized out>}
        cv = {<optimized out>, <optimized out>}
        k = <optimized out>
        k = <optimized out>
        UV = 0x7ffe45800040
        saturation = 0x7ffe11c00040
        sat_shift = <optimized out>
        pixels = <optimized out>
        a_full = 0x7ffe0c200040
        b_full = 0x7ffe0bc00040
#1  0x00007ffff6fce997 in gomp_thread_start (xdata=<optimized out>)
    at /usr/src/debug/gcc/gcc/libgomp/team.c:129
        team = 0x7fffa4026480
        task = 0x7fffa40284e0
        data = <optimized out>
        thr = <optimized out>
        pool = <optimized out>
        local_fn = 0x7fffd81af740 <_prefilter_chromaticity._omp_fn.3>
        local_data = 0x7fffd33e11e0
#2  0x00007ffff1dbfded in start_thread (arg=<optimized out>)
    at pthread_create.c:447
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140729162663616, -2270786405953605323, 140729162663616, -57928, 110, 140736737418544, -2270786405907467979, -2269964834939244235}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#3  0x00007ffff1e430dc in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

jenshannoschwalm commented 2 weeks ago

So both of you are on arch ...

@piratenpanda could you try to use DT_OMP_FOR() at that specific loop L 544 ? That loop is so silly-easy, how could there be something wrong?

Any chance to check on a debug version? Or maybe gdb?

@ralfbrown any fresh idea?

ralfbrown commented 2 weeks ago

Clutching at straws too.... I might get a fresh idea looking at the generated code. After compiling, run

objdump -S build/lib64/darktable/plugins/libcolorequal.so >colorequal.s

and upload colorequal.s. (You can compress it, as it will be fairly large.)

In the mean time, I'll suggest @elstoc trying shared instead of firstprivate, i.e.

#ifdef _OPENMP
#pragma omp parallel for default(shared) schedule(static) collapse(2)
#endif

elstoc commented 2 weeks ago

I'll suggest @elstoc trying shared instead of firstprivate

crash

upload colorequal.s

colorequal.s.txt (renamed to .txt so github allows it)

jenshannoschwalm commented 2 weeks ago

I think I found a bug in ce code

for(size_t k = 0; k < pixels; k++)
  {
    // For each correction factor, we re-express it as a[0] * U + a[1] * V + b
    const float uv[2] = { UV[2 * k + 0], UV[2 * k + 1] };
    const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
                          a_full[4 * k + 2] * uv[0] + a_full[4 * k + 3] * uv[1] + b_full[2 * k + 1] };

seems to be bad, we should align the float array or use 2 const float each. Will do the pr tonight.

AxelG-DE commented 2 weeks ago

interestingly both of my machines do not crash. # Anything I can do, let me know

ralfbrown commented 2 weeks ago

Looking through the objdump output, the one substantive difference I see between Chris's compilation and mine is that mine does (float)size_t_var using the vcvtsi2ss instruction while Chris's uses vcvtusi2ss. The latter is "more correct", but a quick Google indicates that that instruction was only added with AVX-512, which my machine doesn't have. Both of our compilations vectorize the interpolate_bilinear loop using 256-bit vector registers; it may take additional compiler flags to get 512-bit vectorization (using zmmN registers instead of ymmN).

@elstoc as a test, try changing all of the size_ts in interpolate_bilinear to ssize_t (two s's, refers to a signed version). If that eliminates the crash, then the vctvusi2ss instruction is the culprit and we need to figure out how to keep GCC from using it on your machine.

elstoc commented 2 weeks ago

that instruction was only added with AVX-512, which my machine doesn't have

Ooh have I got bleeding edge hardware as well as a bleeding edge distro? Arch is the canary in the coal mine

try changing all of the size_ts in interpolate_bilinear to ssize_t

crash

ralfbrown commented 2 weeks ago

Well, I am now officially out of ideas other than reverting that instance of DT_OMP_FOR back to the original written-out pragma, with a comment to the effect that the written-out version is needed to avoid crashes.

jenshannoschwalm commented 2 weeks ago

Just a reminder, @elstoc machine didn't crash when the buffers where enlarged. @elstoc @piratenpanda would you try colorequal.zip

@ralfbrown there are two ideas 1) make sure each buffer is at least one cacheline larger than required per "plane" 2) don't use non-aligned 2float arrays on the stack. Let's see ...

elstoc commented 2 weeks ago

reverting that instance of DT_OMP_FOR

Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.

jenshannoschwalm commented 2 weeks ago

reverting that instance of DT_OMP_FOR

Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.

I agree with @elstoc here. Trying to find the OMP specifications for simd default(firstprivate) ...

elstoc commented 2 weeks ago

try colorequal.zip

crash

piratenpanda commented 2 weeks ago

same for me

ralfbrown commented 2 weeks ago

@elstoc is your current crash in _prefilter_chromaticity at line 544(-ish) or in interpolate_bilinear? I just took a close look at your original backtrace, and its crash was at 544, same as piratenpanda reported. Messing with the DT_OMP_FOR in interpolate_bilinear may merely be causing downstream effects (due to inlining?) which determine whether there's a crash or not....

@jenshannoschwalm Even though the uv and cv arrays aren't aligned, the compiler is vectorizing and optimizing them away, into 256-bit vector registers (thus running four loop iterations at once, plus cleanup of the leftover few). Chris's version is using a bunch of AVX-512 instructions.

elstoc commented 2 weeks ago

Latest backtrace (from the colorequal.c provided by @jenshannoschwalm ) is darktable_bt_O3ZQN2.txt

piratenpanda commented 2 weeks ago

Here's mine darktable_bt_DER5N2.txt

jenshannoschwalm commented 2 weeks ago

Could you check what happens with modified loop parameters? Start with 8 for example and stop as k < pixels-8? What happens when doing that loop with DT_OMP_FOR()?

ralfbrown commented 2 weeks ago

Both backtraces show the crash in the loop in _prefilter_chromaticity, on the line setting the (first) value of cv*. So it appears either a_full or b_full is being overrun. But the buffer size has been increased by at least 64 over the actual number of pixels....

In looking over the code for the loop, I noticed that _get_satweight is actually clamping the saturation more tightly than necessary: [-0.5, +0.5], while the interpolation table actually supports [-1.0, 1.0). But that wouldn't be causing a crash.

jenshannoschwalm commented 2 weeks ago

In looking over the code for the loop, I noticed that _get_satweight is actually clamping the saturation more tightly than necessary: [-0.5, +0.5], while the interpolation table actually supports [-1.0, 1.0). But that wouldn't be causing a crash.

See const double val = 0.5 / (double)SATSIZE * (double)i; in initializing the table.

ralfbrown commented 2 weeks ago

I did see that, but if you look more closely, isat in _get_satweight is limited to the range SATSIZE/2 to 3*SATSIZE/2 because 1+CLAMP() can only take values 0.5 to 1.5, and the array index used is (int)floor(isat). But the array itself runs from 0 to 2*SATSIZE, and all of those entries have been filled by _init_satweights.

jenshannoschwalm commented 2 weeks ago

Right, now i see that and it's indeed restricting the effect. Probably i simply oversaw that while introducing the contrast thing.

Both backtraces show the crash in the loop in _prefilter_chromaticity, on the line setting the (first) value of cv*. So it appears either a_full or b_full is being overrun.

Due to optimizing it could also be the UV overrun.

One other idea, could this all be due to the #include "common/extra_optimizations.h" ?

piratenpanda commented 2 weeks ago

One other idea, could this all be due to the #include "common/extra_optimizations.h" ?

I removed it and it didn't change anything unfortunately

TurboGit commented 2 weeks ago

Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.

Maybe but we had no report about crash before, so maybe a quite restricted issue on this part of code as it fixes the bug for you and @piratenpanda.

elstoc commented 2 weeks ago

Sure but you'd had no report about a crash until I reported it either. And I haven't really tested master much since that PR went in - I literally built it and it crashed because of an existing edit. Also that PR is only a couple of weeks old and not everyone updates that frequently

jenshannoschwalm commented 2 weeks ago

I literally built it and it crashed because of an existing edit

would you share raw & xmp?

piratenpanda commented 2 weeks ago

For me it happens as soon as colorequalizer is enabled. Doesn't matter if presets or just manual changes

darktable-org / darktable

Segfault opening darktable #16781