Closed elstoc closed 1 week ago
I think it might be related to color equalizer. Retried with a clean config and unedited images. As soon as I enable color equalizer on an image I get the segfault.
Ping @jenshannoschwalm?
Bisect suggests 5aa29dc13fa1484856453c9e2b45faab87698067 is the first bad commit
No idia so far. Using CE daily without issues, also the code has not changed for quite a while except the new OMP macro usage. Maybe a log can give a clue -d pipe
for a first try. Just recompiled everything and having no issues ...
I guess we have to go the hard way, there seem to be some uses of DT_OMP_FOR_SIMD
that seem to be wrong. Could you try with DT_OMP_FOR
in interpolate_bilinear
found in fast_guided_filter.h
that simply looks wrong to me.
Likewise, I'm using CE often.
Could you try with DT_OMP_FOR in interpolate_bilinear found in fast_guided_filter.h that simply looks wrong to me.
Not entirely sure what you mean. If you mean to replace DT_OMP_FOR_SIND with DT_OMP_FOR on line 103 that didn't seem to work. Edit: looking at your latest PR seems that was what you meant. Building latest master after pulling that PR didn't change anything
Also seeing something similar I guess:
Thread 289 "worker 5" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffdf78006c0 (LWP 28537)]
0x00007fffd810c8ce in _prefilter_chromaticity._omp_fn.3 ()
at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
544 const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
(gdb) bt full
#0 0x00007fffd810c8ce in _prefilter_chromaticity._omp_fn.3 ()
at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
uv = {<optimized out>, <optimized out>}
cv = {<optimized out>, <optimized out>}
k = <optimized out>
k = <optimized out>
UV = Python Exception <class 'gdb.MemoryError'>: Cannot access memory at address 0x7ffdf77f0e08
I tried removing DT_OMP_FOR(collapse(2))
entirely from the interpolate_bilinear
function in fast_guided_filter.h
and that did prevent the issue. So we're definitely looking in the right place.
Similarly if I replace it with the code that was removed in 5aa29dc, it works
You mean if in interpolate_bilinear
you replace DT_OMP_FOR(collapse(2)
by
#ifdef _OPENMP
#pragma omp parallel for collapse(2) default(none) \
dt_omp_firstprivate(in, out, width_out, height_out, width_in, height_in, ch) \
schedule(simd:static)
#endif
that "works"?
Pinging @ralfbrown and @dterrahe here, i don't yet fully understand that new omp macro stuff yet, what could be wrong here? Might ____DT_CLONE_TARGETS__
be a problem? Could it be we have some parameters wrongfully used as shared
?
Clone targets shouldn't be an issue, the directive basically tells the compiler to compile the function multiple times, with a different target architecture each time. But I won't categorically rule it out.
Since all of the variables named in the original OpenMP directive are const
, it wouldn't matter whether they are declared shared
or private
since there are no modifications to be propagated to other threads.
DT_OMP_FOR(collapse(2)) expands to #pragma omp parallel for default(firstprivate) schedule(static) collapse(2)
DT_OMP_FOR_SIMD(collapse(2)) expands to #pragma omp parallel for simd default(firstprivate) schedule(simd:static) collapse(2)
@elstoc Does your copy still work correctly if you use the original directive but remove the "simd:" from it? That would give the equivalent of DT_OMP_FOR. Does it work correctly if you replace the DT_OMP_FOR with just DT_OMP_FOR()
? If so, the problem is related to merging the nested loops into a mega-loop, but parallelizing just the outer loop gives nearly all of the speedup we'd get with the collapse(2).
You mean if in interpolate_bilinear you replace DT_OMP_FOR(collapse(2) by...
Yes. Works with this replacement
Does your copy still work correctly if you use the original directive but remove the "simd:" from it?
Yes. Also works.
Does it work correctly if you replace the DT_OMP_FOR with just DT_OMP_FOR()?
No. Crashes.
Hmm. Does it work if you keep the DT_OMP_FOR but remove the __DT_CLONE_TARGETS__
? Beginning to suspect a compiler bug....
Does it work if you keep the
DT_OMP_FOR
but remove the__DT_CLONE_TARGETS__
?
no. BTW I also tried downgrading gcc (from 14.1.1 to 13.2.1) to no effect
Beginning to suspect a compiler bug....
Doesn't this look more like a macro problem? Are you sure the #pragma omp parallel for default(firstprivate)
is fully correct?
Also - i am very surprised that the bug is so obvious for chris and i didn't have a single crash for long also using master.
gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)
here
That was going to be my next try: replace the DT_OMP_FOR with
#ifdef _OPENMP
#pragma omp parallel for default(firstprivate) schedule(static) collapse(2)
#endif
which should be exactly the pragma that DT_OMP_FOR(collapse(2))
expands into.
default(none)
forces manual naming of the sharing status of all variables used by the loop, so if it compiled without errors, all of the variables have been named, and they are all firstprivate
- which is what default(firstprivate)
gives you without having to explicitly name the variables in the pragma.
@ralfbrown that #ifdef
replacement also caused failure
Clutching at straws here but worth mentioning that I only build using the build.sh script, and clear both the build and install directories beforehand (just in case the people who have it working use some other mechanism).
my build script but i guess nothing special
#!/bin/sh
DTDIR=/home/hanno/sources/darktable
ANSWER=""
if [ ! -d "$DTDIR" ]; then
echo "no darktable directory found"
read -t 10 ANSWER
exit
fi
cd $DTDIR
if [ -d "build" ]; then
echo "uninstall old builds <sudo>"
cd build
sudo make uninstall
make clean
cd ..
sudo rm -fr /home/hanno/sources/darktable/build
sudo rm -fr /home/hanno/.cache/ccache
fi
git checkout master
git pull upstream master
git fetch upstream
git submodule update
rm -fr /home/hanno/.cache/darktable/cached_v*
./build.sh --prefix /usr/local --disable-game --disable-kwallet --disable-unity --enable-use_libraw --build-type Release --install --sudo
Have you tried building with clang, possibly ruling out a compiler problem? Last line here
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
./build.sh --prefix /usr/local --disable-game --disable-kwallet --disable-unity --enable-use_libraw --build-type Release --install --sudo
BTW - not mentioning ArchL... here but it wouldn't be the first time :-)
Have you tried building with clang
I get a bunch of CMake errors when trying to build with clang which I don't know how to resolve, so stick with gcc
Have you tried building with clang, possibly ruling out a compiler problem? Last line here
does not change anything unfortunately. Still crashing in colorequal.c:544
A summary as i understand it for now:
There could be 1) a bug in the way we use the for default(firstprivate)
pragma, - @elstoc tests seem to indicate that 2) that could be correct but there is a compiler/omp issue either depending on build or system or 3) there is a bug in CE code that triggers the issue on some systems - @piratenpanda report might pinpoint to that or 4) don't know.
@piratenpanda i might be good to report about your system here :-) (distribution, gcc version, libgomp?
There is a hypothesis i would like you both to test, could you compile with this colorequal.c replacing what we have? colorequal.zip
Seems to work for me with the provided colorequal.zip
also running arch here, gcc 14.1.1
For me colorequal.zip does not work, but bt full now is longer:
#0 0x00007fffd81af8ac in _prefilter_chromaticity._omp_fn.3 ()
at /home/panda/Downloads/darktable/src/iop/colorequal.c:544
uv = {<optimized out>, <optimized out>}
cv = {<optimized out>, <optimized out>}
k = <optimized out>
k = <optimized out>
UV = 0x7ffe45800040
saturation = 0x7ffe11c00040
sat_shift = <optimized out>
pixels = <optimized out>
a_full = 0x7ffe0c200040
b_full = 0x7ffe0bc00040
#1 0x00007ffff6fce997 in gomp_thread_start (xdata=<optimized out>)
at /usr/src/debug/gcc/gcc/libgomp/team.c:129
team = 0x7fffa4026480
task = 0x7fffa40284e0
data = <optimized out>
thr = <optimized out>
pool = <optimized out>
local_fn = 0x7fffd81af740 <_prefilter_chromaticity._omp_fn.3>
local_data = 0x7fffd33e11e0
#2 0x00007ffff1dbfded in start_thread (arg=<optimized out>)
at pthread_create.c:447
ret = <optimized out>
pd = <optimized out>
out = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140729162663616, -2270786405953605323, 140729162663616, -57928, 110, 140736737418544, -2270786405907467979, -2269964834939244235}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
#3 0x00007ffff1e430dc in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
So both of you are on arch ...
@piratenpanda could you try to use DT_OMP_FOR()
at that specific loop L 544 ? That loop is so silly-easy, how could there be something wrong?
Any chance to check on a debug version? Or maybe gdb?
@ralfbrown any fresh idea?
Clutching at straws too.... I might get a fresh idea looking at the generated code. After compiling, run
objdump -S build/lib64/darktable/plugins/libcolorequal.so >colorequal.s
and upload colorequal.s. (You can compress it, as it will be fairly large.)
In the mean time, I'll suggest @elstoc trying shared
instead of firstprivate
, i.e.
#ifdef _OPENMP
#pragma omp parallel for default(shared) schedule(static) collapse(2)
#endif
I'll suggest @elstoc trying shared instead of firstprivate
crash
upload colorequal.s
colorequal.s.txt (renamed to .txt so github allows it)
I think I found a bug in ce code
for(size_t k = 0; k < pixels; k++)
{
// For each correction factor, we re-express it as a[0] * U + a[1] * V + b
const float uv[2] = { UV[2 * k + 0], UV[2 * k + 1] };
const float cv[2] = { a_full[4 * k + 0] * uv[0] + a_full[4 * k + 1] * uv[1] + b_full[2 * k + 0],
a_full[4 * k + 2] * uv[0] + a_full[4 * k + 3] * uv[1] + b_full[2 * k + 1] };
seems to be bad, we should align the float array or use 2 const float each. Will do the pr tonight.
interestingly both of my machines do not crash. # Anything I can do, let me know
Looking through the objdump
output, the one substantive difference I see between Chris's compilation and mine is that mine does (float)size_t_var
using the vcvtsi2ss
instruction while Chris's uses vcvtusi2ss
. The latter is "more correct", but a quick Google indicates that that instruction was only added with AVX-512, which my machine doesn't have. Both of our compilations vectorize the interpolate_bilinear
loop using 256-bit vector registers; it may take additional compiler flags to get 512-bit vectorization (using zmmN registers instead of ymmN).
@elstoc as a test, try changing all of the size_t
s in interpolate_bilinear
to ssize_t
(two s's, refers to a signed version). If that eliminates the crash, then the vctvusi2ss
instruction is the culprit and we need to figure out how to keep GCC from using it on your machine.
that instruction was only added with AVX-512, which my machine doesn't have
Ooh have I got bleeding edge hardware as well as a bleeding edge distro? Arch is the canary in the coal mine
try changing all of the
size_t
s ininterpolate_bilinear
tossize_t
crash
Well, I am now officially out of ideas other than reverting that instance of DT_OMP_FOR back to the original written-out pragma, with a comment to the effect that the written-out version is needed to avoid crashes.
Just a reminder, @elstoc machine didn't crash when the buffers where enlarged. @elstoc @piratenpanda would you try colorequal.zip
@ralfbrown there are two ideas 1) make sure each buffer is at least one cacheline larger than required per "plane" 2) don't use non-aligned 2float arrays on the stack. Let's see ...
reverting that instance of DT_OMP_FOR
Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.
reverting that instance of DT_OMP_FOR
Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.
I agree with @elstoc here. Trying to find the OMP specifications for simd default(firstprivate)
...
try colorequal.zip
crash
same for me
@elstoc is your current crash in _prefilter_chromaticity
at line 544(-ish) or in interpolate_bilinear
? I just took a close look at your original backtrace, and its crash was at 544, same as piratenpanda reported. Messing with the DT_OMP_FOR in interpolate_bilinear
may merely be causing downstream effects (due to inlining?) which determine whether there's a crash or not....
@jenshannoschwalm Even though the uv
and cv
arrays aren't aligned, the compiler is vectorizing and optimizing them away, into 256-bit vector registers (thus running four loop iterations at once, plus cleanup of the leftover few). Chris's version is using a bunch of AVX-512 instructions.
Latest backtrace (from the colorequal.c provided by @jenshannoschwalm ) is darktable_bt_O3ZQN2.txt
Here's mine darktable_bt_DER5N2.txt
Could you check what happens with modified loop parameters? Start with 8 for example and stop as k < pixels-8
?
What happens when doing that loop with DT_OMP_FOR()
?
Both backtraces show the crash in the loop in _prefilter_chromaticity, on the line setting the (first) value of cv*. So it appears either a_full or b_full is being overrun. But the buffer size has been increased by at least 64 over the actual number of pixels....
In looking over the code for the loop, I noticed that _get_satweight
is actually clamping the saturation more tightly than necessary: [-0.5, +0.5], while the interpolation table actually supports [-1.0, 1.0). But that wouldn't be causing a crash.
In looking over the code for the loop, I noticed that
_get_satweight
is actually clamping the saturation more tightly than necessary: [-0.5, +0.5], while the interpolation table actually supports [-1.0, 1.0). But that wouldn't be causing a crash.
See const double val = 0.5 / (double)SATSIZE * (double)i;
in initializing the table.
I did see that, but if you look more closely, isat
in _get_satweight
is limited to the range SATSIZE/2 to 3*SATSIZE/2 because 1+CLAMP() can only take values 0.5 to 1.5, and the array index used is (int)floor(isat). But the array itself runs from 0 to 2*SATSIZE, and all of those entries have been filled by _init_satweights
.
Right, now i see that and it's indeed restricting the effect. Probably i simply oversaw that while introducing the contrast thing.
Both backtraces show the crash in the loop in _prefilter_chromaticity, on the line setting the (first) value of cv*. So it appears either a_full or b_full is being overrun.
Due to optimizing it could also be the UV overrun.
One other idea, could this all be due to the #include "common/extra_optimizations.h"
?
One other idea, could this all be due to the
#include "common/extra_optimizations.h"
?
I removed it and it didn't change anything unfortunately
Yeah, but think how many of these there were in that PR. There could be a bunch of other crashes just waiting to happen.
Maybe but we had no report about crash before, so maybe a quite restricted issue on this part of code as it fixes the bug for you and @piratenpanda.
Sure but you'd had no report about a crash until I reported it either. And I haven't really tested master much since that PR went in - I literally built it and it crashed because of an existing edit. Also that PR is only a couple of weeks old and not everyone updates that frequently
I literally built it and it crashed because of an existing edit
would you share raw & xmp?
For me it happens as soon as colorequalizer is enabled. Doesn't matter if presets or just manual changes
Built the latest master (9920ae7c1) and get a segfault shortly afterwards. Lighttable view shows for a few seconds and then it crashes. Running on ArchLinux with nVidia, but get the same thing even when running with
--disable-opencl
.Backtrace: darktable_bt_87TZN2.txt