Darktable git crashes in Permutohedral.h:485

piratenpanda commented 3 years ago

Describe the bug/issue While working on pictures, switching to another picture or just opening the first image, darktable crashes with the same error for me since a while in Permutohedral.h:485 (see attached stacktrace)

To Reproduce I really don't know how to reproduce, it seems to happen randomly to my obsservation. When restarting darktable and running the same steps as what lead to the crash does not crash darktable again.

Stack trace

Thread 99 "worker res 1" received signal SIGBUS, Bus error.
[Switching to Thread 0x7fff1ffff640 (LWP 23858)]
0x00007fff70c12471 in PermutohedralLattice<5, 4>::splat (this=0x7fffc5fd72b0, position=position@entry=0x7fff1ffee440, value=value@entry=0x7fff1ffee430, replay_index=replay_index@entry=1117698, thread_index=thread_index@entry=3) at /home/panda/Downloads/dtcompile/darktable/src/iop/Permutohedral.h:485
485       barycentric[D - rank[i]] += (elevated[i] - greedy[i]) * scale;
(gdb) bt full
#0  0x00007fff70c12471 in PermutohedralLattice<5, 4>::splat(float*, float*, unsigned long, int) const
    (this=0x7fffc5fd72b0, position=position@entry=0x7fff1ffee440, value=value@entry=0x7fff1ffee430, replay_index=replay_index@entry=1117698, thread_index=thread_index@entry=3)
    at /home/panda/Downloads/dtcompile/darktable/src/iop/Permutohedral.h:485
        i = 4
        elevated = 
          {446365920, 446365856, -892731712, 4.28966618, 4.3655386, -26.3971424}
        greedy = {446365946, 446365914, -892731782, 0, 0, -30}
        rank = 
          {-1062557013, 1084926635, 0, 1060570776, -1086912872, 1093315243}
        barycentric = {0, 0, 0, 0, 0, 0, 0}
        key = 
            {hash = <optimized out>, key = {<optimized out>, <optimized out>, <optimized out>, <optimized out>, <optimized out>}}
        sum = <optimized out>
#1  0x00007fff70c108d9 in process._omp_fn.1(void) ()
    at /home/panda/Downloads/dtcompile/darktable/src/iop/bilateral.cc:226
        pos = {7.03762054, 223182928, 0.287140638, 0.208565861, 5.90258074}
        val = {0.0574281216, 0.041713167, 0.0295129027, 1}
        i = 34
        in = 0x7fff2821b060
        thread = <optimized out>
        index = 1117698
        j = 536798272
        ivoid = <optimized out>
        roi_in = 0x7fffc5fd7830
        ch = <optimized out>
        sigma = {0.206988841, 0.206988841, 5.00000048, 5.00000048, 200}
        lattice = 
          {nData = 1158522, nThreads = 4, scaleFactor = 0x7fffb004b8f0, canonical = 0x7fffb004a220, replay = 0x7ffefca0a010, hashTables = 0x7fffb00bc118}
        lattice = 
          {nData = 1896583536, nThreads = 32767, scaleFactor = 0xffff00009ff1, canonical = 0xb916a4c0b90a1940, replay = 0x3b0149183b77eb76, hashTables = 0x39aa268f39a3f63b}
        data = <optimized out>
        ch = <optimized out>
        sigma = 
          {2.9387583e-39, 0, -3.01532605e+30, 4.59163468e-41, -16.0413265}
        rad = <optimized out>
#2  0x00007ffff74e46c6 in gomp_thread_start (xdata=<optimized out>)
    at /build/gcc/src/gcc/libgomp/team.c:125
        team = 0x7fffb0022ce0
        task = 0x7fffb00234a8
        data = <optimized out>
        pool = 0x7fffb0022c10
        local_fn = 0x7fff70c10740 <process._omp_fn.1(void)>
        local_data = 0x7fffc5fd72e0
#3  0x00007ffff7551259 in start_thread () at /usr/lib/libpthread.so.0
#4  0x00007ffff21f95e3 in clone () at /usr/lib/libc.so.6

Platform

darktable version : 3.7.0+1112~gc2295c9a2-dirty
OS : 5.14.7-arch1-1
Linux - Distro : latest Arch Linux
Memory : 32 GB
Graphics card : Radeon RX 580 Series (POLARIS10, DRM 3.42.0, 5.14.7-arch1-1, LLVM 12.0.1) (0x67df)
Graphics driver : Mesa 21.2.2
OpenCL installed : yes
OpenCL activated : yes
Xorg : 1.20.13-2
Desktop : GNOME 40
GTK+ : 3.24.30 / 4.4.0
gcc : 11
cflags : no changes
CMAKE_BUILD_TYPE : Release

Additional context

Can you reproduce with another darktable version(s)? happens for a while in git now, didn't try 3.6 branch due to missing cr3 support
Can you reproduce with a RAW or Jpeg or both? Happens to several raws so I doubt it's image specific
Are the steps above reproducible with a fresh edit (i.e. after discarding history)? yes
Is the issue still present using an empty/new config-dir (e.g. start darktable with --configdir "/tmp")? so far I was unable to have a crash but sometimes dt doesn't crrash for a while so I can't say for sure
Do you use lua scripts? No

ptilopteri commented 3 years ago

perhaps the issue is related to your "-dirty" build.

remove the build directory and the executables directory and rebuild.

piratenpanda commented 3 years ago

Does not change the version name and darktable instantly crashed afterwards so I'm afraid that's not the problem.

just had another crash where bilateral.cc was the last item:

#0  0x00007fff70c20908 in process._omp_fn.1(void) ()
    at /home/panda/Downloads/dtcompile/darktable/src/iop/bilateral.cc:224
        pos = {16.9730854, 180.494263, -nan(0x400000), 1.06619692, 33.872673}
        val = {-nan(0x400000), 0.213239372, 0.169363365, 1}
        i = 83
        in = 0xffc000003f049e60
        thread = 3
        index = 18428729679490842625
        j = 456156224
        ivoid = <optimized out>
        roi_in = 0x7fffc5fd6520
        ch = <optimized out>
Python Exception <class 'gdb.MemoryError'>: Cannot access memory at address 0xffc00000ffc00010
#1  0x00007ffff74cf6c6 in gomp_thread_start (xdata=<optimized out>)
    at /build/gcc/src/gcc/libgomp/team.c:125
        team = 0x7fffb001f310
        task = 0x7fffb001fad8
        data = <optimized out>
        pool = 0x7fffb001f240
        local_fn = 0x7fff70c207f0 <process._omp_fn.1(void)>
        local_data = 0x7fffc5fd5fa0
#2  0x00007ffff753c259 in start_thread () at /usr/lib/libpthread.so.0
#3  0x00007ffff21e45e3 in clone () at /usr/lib/libc.so.6

ptilopteri commented 3 years ago

if the version name still says "dirty", nothing was accomplished and you have not confirmed that the problem remains

piratenpanda commented 3 years ago

I removed /opt/darktable and did a complete fresh git checkout, submodule init and changed to the CR3 branch. What else do you suggest to not have remains from older builds. I don't see what I can do tbh

ptilopteri commented 3 years ago

something is failing if you still have a "dirty" build.

piratenpanda commented 3 years ago

It also shows "dirty" for https://aur.archlinux.org/packages/darktable-cr3-git so I don't know where this should come from

parafin commented 3 years ago

It’s dirty because of switch to CR3 rawspeed branch.

github-actions[bot] commented 2 years ago

This issue did not get any activity in the past 60 days and will be closed in 365 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.

piratenpanda commented 2 years ago

Still happens with a clean compilation from latest git and happens frequently when using the surface blur module

0x00007fffa5380908 in process._omp_fn.1(void) () at /home/panda/Downloads/dtcompile/darktable/src/iop/bilateral.cc:224
224         float pos[5] = { i * sigma[0], j * sigma[1], in[0] * sigma[2], in[1] * sigma[3], in[2] * sigma[4] };
(gdb) bt full
#0  0x00007fffa5380908 in process._omp_fn.1(void) ()
    at /home/panda/Downloads/dtcompile/darktable/src/iop/bilateral.cc:224
        pos = {223.440048, 37.7170868, nan(0x400000), 16.0017223, 14.6885347}
        val = {nan(0x400000), 0.0800086111, 0.0734426752, 1}
        i = -1077966591
        in = 0x7fc000007fc00010
        thread = 1
        index = 2143289345
        j = 1722680320
        ivoid = <optimized out>
        roi_in = 0x7fffcd7d7900
        ch = <optimized out>
Python Exception <class 'gdb.MemoryError'>: Cannot access memory at address 0x7fc000007fc00010
#1  0x00007ffff73026c6 in gomp_thread_start (xdata=<optimized out>)
    at /build/gcc/src/gcc/libgomp/team.c:125
        team = 0x7fffb801d950
        task = 0x7fffb801df68
        data = <optimized out>
        pool = 0x7fffb801d880
        local_fn = 0x7fffa53807f0 <process._omp_fn.1(void)>
        local_data = 0x7fffcd7d7380
#2  0x00007ffff72ce259 in start_thread () at /usr/lib/libpthread.so.0
#3  0x00007ffff1eed5e3 in clone () at /usr/lib/libc.so.6

paolodepetrillo commented 2 years ago

I am wondering if this might be related to the change #6158.

Before that change, omp_get_max_threads() was called here and space for that many threads was allocated within the PermutohedralLattice object. Then in the following for loop omp_get_thread_num() would be called and return a value from 0 to omp_get_max_threads() - 1.

But after the #6158 change, omp_get_num_procs() is called instead, from dt_get_num_threads(). While that value would typically be the same as omp_get_max_threads(), it's not guaranteed. If it does return a value less than omp_get_max_threads, for whatever reason, then not enough space will be allocated within the PermutohedralLattice object on the stack, and some nearby local variables on the stack will be overwritten resulting in a crash similar to the ones above.

I can force a crash like this by running darktable with the command line parameter "-t N" to set the max threads. For example I have a 6-core hyperthreaded CPU so omp_get_num_procs and omp_get_max_threads both return 12. If I run darktable with "-t 13", it crashes when I enable the surface blur module.

piratenpanda commented 2 years ago

Reverting this change makes dt crash as soon as I enable the surface blur module but with related/same errors.

#0  0x00007fff9a28698f in process._omp_fn.1(void) () at /home/panda/Downloads/dtcompile/darktable/src/iop/bilateral.cc:227
        i = 1311
        in = 0x7fff1c927200
        thread = 2
        index = 754460
        j = 2145180196
        ivoid = <optimized out>
        roi_in = 0x7fdcda247fdcda24
        ch = <optimized out>
        sigma = {0.413977683, 0.413977683, 200, 200, 200}
        lattice = 
          {nData = 1160720, nThreads = 4, scaleFactor = 0x7fffb403ee00, canonical = 0x7fffb40416a0, replay = 0x7fff44670010, hashTables = 0x7fffb4043b88}
Python Exception <class 'gdb.MemoryError'>: Cannot access memory at address 0x120
        lattice = #1  0x00007ffff73026c6 in gomp_thread_start (xdata=<optimized out>) at /build/gcc/src/gcc/libgomp/team.c:125
        team = 0x7fffb401d950
        task = 0x7fffb401e040
        data = <optimized out>
        pool = 0x7fffb401d880
        local_fn = 0x7fff9a286800 <process._omp_fn.1(void)>
        local_data = 0x7fffcdfd8380
#2  0x00007ffff72ce259 in start_thread () at /usr/lib/libpthread.so.0
#3  0x00007ffff1eed5e3 in clone () at /usr/lib/libc.so.6

Could this potentially be an openMP issue?

piratenpanda commented 2 years ago

Doesn't seem to happen without OpenCL. At least I haven't managed to get it to crash without it

piratenpanda commented 2 years ago

As per https://discuss.pixls.us/t/amd-opencl-problems-in-surface-blur-darktable-module/28507/13 it seems like NaNs in RCD which cause green artifacts for me will make the surface blur module crash. I'll open another issue for that

darktable-org / darktable

Darktable git crashes in Permutohedral.h:485 #10082