darktable-org / darktable

darktable is an open source photography workflow application and raw developer
https://www.darktable.org
GNU General Public License v3.0
9.77k stars 1.14k forks source link

NaNs in RCD opencl demosaicing most likely make the surface blur module crash #10778

Closed piratenpanda closed 2 years ago

piratenpanda commented 2 years ago

I was suffering a lot of crashes when using the surface blur module lately. See https://github.com/darktable-org/darktable/issues/10082 for reference.

While it wasn't quite obvious where the issue came from it seems that NaNs cause the module to crash as per the last stacktrace and comment here: https://discuss.pixls.us/t/amd-opencl-problems-in-surface-blur-darktable-module/28507/12

I suspect the NaNs are coming from RCD (opencl version) as when using AMAZE, I don't see green blocks at the border regions and also can't reproduce the crashes I had in the surface blue module so far.

This is happening since a while now. I will try to bisect RCD changes to see if there was a specific commit introducing the bug as I was able to use surface blur without issues before in the 3.8 development cycle. I am happy for further input on how to help providing more information that could be helpful.

Since when: In the 3.8 development cycle Graphics card: AMD Radeon RX 580 Series (POLARIS10, DRM 3.42.0, 5.15.12-arch1-1, LLVM 13.0.0) running the latest opencl-amd driver available on Arch (OpenCL 2.0 AMD-APP (3354.7) Gnome 41.2 on Wayland Intel i5-7600K CPU Darktable compiled as Release build (also happens in the Arch 3.8.0 build)

piratenpanda commented 2 years ago

Seems the latest RCD related changes were in the 3.6 cycle

jenshannoschwalm commented 2 years ago

Difficult stuff.

ATM i can't reproduce at all, there have been NaN issues but as far as i remember not related to RCD. I also read through the rcd related kernels and there was no obvious div-by-zero problem.

At least please check on dt 3.8 master, tell about graphics and system memory and the green smoothing setting you are using.

Also if you compile yourself you could disable opencl for rcd demosaicing and see if this resolves the issue.

In line 5589 in iop/demosaic.c

    case DT_IOP_DEMOSAIC_RCD:
      piece->process_cl_ready = 1;
      break;

set the enable flag to 0.

Also - raw and xmp files might be helpful

piratenpanda commented 2 years ago

So I built latest master with the change you suggested. With opencl off for RCD I can't reproduce the issue at all, with it on it almost immediately happens for some but not all images.

The green blocks look like this: green_blocks .

CR3 file plus XMP file here: https://pandainthecloud.de/nextcloud/index.php/s/figyiiqcM6b7LSy

Graphics card info:

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 580 Series (POLARIS10, DRM 3.42.0, 5.15.12-arch1-1, LLVM 13.0.0) (0x67df)
    Version: 21.3.3
    Accelerated: yes
    Video memory: 8192MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6

and 32 GB of RAM

jenshannoschwalm commented 2 years ago

Only with cr3 files?

piratenpanda commented 2 years ago

Had this happen with cr2s also. I can take a look if I can find one I can share

piratenpanda commented 2 years ago

Not a particularly good image but it suffers a lot from the outer border artifacts for the fast demosaicer as well as for RCD: https://pandainthecloud.de/nextcloud/index.php/s/pWkWxFEiaabzs75

green artifacts: green_again

Quick demosaicer artifacts: quick

jenshannoschwalm commented 2 years ago

The images you sent show border artefacts. Yes - RCD and PPG both use basically the same demosaicer for the border regions. (BTW all demosaicers use some special treatment of the border region)

But is this really related to any crash your reported first?

I tried your image and xmp, here on nvidia opencl there is nothing wrong.

piratenpanda commented 2 years ago

I don't know if they are related. I thought NaNs cause those artifacts but obviously I was wrong.

So it might be a driver issue after all? Do we have a place to report to AMD? It's crashing very consistently for me

jenshannoschwalm commented 2 years ago

Those border artefacts are no NaN problems. In PPG you should see artefacts in the outermost 3 pixels, in RCD it might be up to 6 pixels.

I looked throught the opencl kernel code for rcd but can't find any problematic region at all. We suspected such problems a while ago with cpu code path but the problems could be found elsewhere.

Yes - it might be a driver issue. Really worrying are the black "v" looking artefacts. Those could be NaN.

Could you

There have been quite a number of AMD driver issues around. I have no idea how to specifically track those ...

piratenpanda commented 2 years ago

This is the stacktrace I got for the latest crash: https://www.toptal.com/developers/hastebin/elagijalow.yaml

Also some more in https://github.com/darktable-org/darktable/issues/10082

About those artifacts I don't know how to log them in a meaningful way.

jenshannoschwalm commented 2 years ago

I will have a deep look into this next week.

jenshannoschwalm commented 2 years ago

Just to make sure about a very vague idea - Did you switch on dual demosaicing or the details mask before a crash?

BTW in the mentioned amd issue 1654 you mentioned the specific driver version. Is this issue related to that special driver? Have you tried other drivers? (I know there are several around ...)

piratenpanda commented 2 years ago

Just surface blur and a parametric mask set to chromaticity is all i did. Usually I remove shapes from the mask but this crashes even without shapes. The linked files are at the state which makes them crash after a while of zooming and panning. Sometimes it crashes when just loading the image.

I will test if it crashes without a mask.

I always run the latest opencl-amd package from aur. I don't know how to install an older one. I could take a look how they prepare the PKGBUILD but it involves a lot of deb files and extracting only the needed bits. So i am not sure if I could potentially install an older one.

sofiageo commented 2 years ago

Hi,

I'm the current maintainer of the opencl-amd package I try to Google every once in a while if there are issues with the package and I stumbled upon your issue. I tried your file on my AMD 5700XT and Darktable 3.8 the default demosaic (RCD) creates these green artifacts but doesn't crash. Changing the demosaicing to AMaZE method works fine, no green artifacts. I don't know how to enable/disable OpenCL in darktable, but I think it's being used anyway.

If you want to see if a previous version of the opencl-amd package works better for your GPU, you can download the PKGBUILD from here, remove the current opencl-amd, and makepkg -s the PKGBUILD you just downloaded.

edit: by the way disabling / enabling surface blur seems to make the green artifacts dissapear so it works fine with RCD too. I have no idea how to use this software so if I'm saying anything stupid please correct me :)

piratenpanda commented 2 years ago

Might be a Polaris issue. Thank you very much for the PKGBUILD history, I'll try some older version to see if one works fine.

jenshannoschwalm commented 2 years ago

Oh, thanks for stepping in. I started just today thinking about this issue and i think you are just right. The green border artefacts simply shouldn't be there! As described they seem to be visible whether PPG or RCD are used and this is important as both demosaicers use the same code for border interpolation :-) So we should track that code section and see if the crashes go after a proper fix :-) We will track this down and report back if there is anything that might be of value for you.

jenshannoschwalm commented 2 years ago

@piratenpanda can you canfirm the green border issues without any fancy masking stuff?

piratenpanda commented 2 years ago

yes, completely without any masks:

Bildschirmfoto von 2022-01-06 19-34-41

piratenpanda commented 2 years ago

So I downgraded to 19.50, the one before the note "Updated to latest version, which may or may not work properly on Polaris GPUs" and so far I have not been able to reproduce the crash. All versions above I tried show the crash.

jenshannoschwalm commented 2 years ago

There is a first idea you should check. in data/kernels/demosaic_rcd.cl you will find the rcd_border_green and rcd_border_redblue functions. Each write back to the image buffer at the very end, we should ensure values of >= 0 to be written. Could you try to do for the green border

  color.y = fmax(0.0f, color.y);
  write_imagef (out, (int2)(x, y), color);

and for the redblue border

  color.x = fmax(0.0f, color.x);
  color.z = fmax(0.0f, color.z);
  write_imagef (out, (int2)(x, y), color);
jenshannoschwalm commented 2 years ago

Aah, didn't see your latest comment. That would be simply a driver issue. What about the green borders?

piratenpanda commented 2 years ago

So far no more color artifacts at the border but I need to do some more testing. The black PPG blocks still show though while zooming

jenshannoschwalm commented 2 years ago

While zooming would be a different algo ...

piratenpanda commented 2 years ago

So it seems in RCD the blocks are also there now but much smaller: Bildschirmfoto von 2022-01-06 20-03-43

While zooming would be a different algo ...

I thought quick demosaicing is PPG? I missunderstood you then it seems

jenshannoschwalm commented 2 years ago

No while zooming in there is more involved

piratenpanda commented 2 years ago

ok so then the blocks and "V" shapes above are showing while zooming just to avoid further misunderstandings

piratenpanda commented 2 years ago

Looks like driver issues to me then to be honest. Different artifacts for different driver versions are strange

sofiageo commented 2 years ago

I don't know if that helps but I can't replicate the green artifacts. I tried to delete the darktable config directory, tried to reboot, deleted the .xmp file, nothing. I only saw the green artifacts the first time I opened the image. I'm not sure if it's related, because my PC is also running other software, but near the time I opened the image this driver error occurred (I've removed some lines to make it more readable):

kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 6 PID: 23520 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2da/0x300 [ttm]
kernel: Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
kernel: RIP: 0010:ttm_bo_release+0x2da/0x300 [ttm]
kernel: Call Trace:
kernel:  <TASK>
kernel:  amdgpu_bo_unref+0x1a/0x30 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel:  amdgpu_gem_object_free+0x30/0x50 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel:  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x364/0x3d0 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel:  kfd_process_device_free_bos+0x9f/0xf0 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel:  kfd_process_wq_release+0x20d/0x2e0 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel:  process_one_work+0x1e8/0x3c0
kernel:  worker_thread+0x50/0x3c0
kernel:  ? process_one_work+0x3c0/0x3c0
kernel:  kthread+0x132/0x160
kernel:  ? set_kthread_struct+0x50/0x50
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>
kernel: ---[ end trace 34b8ec6dd0e109ea ]---

I also see that the .xmp file that is being modified is much different. Yours has 17 history steps but mine 11. But I don't think it's related. You could try to find similar stack traces with journalctl -b0 -k

jenshannoschwalm commented 2 years ago

That would hint to some sort of mem management problem.

piratenpanda commented 2 years ago

I tried the latest version again to check. I also see:

kernel: WARNING: CPU: 2 PID: 21104 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2da/0x300 [ttm]
kernel: Modules linked in: snd_seq_dummy snd_seq uas usb_storage nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs uv>
kernel:  x_tables hid_logitech_hidpp hid_logitech_dj usbhid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd xhci_pci xhci_pci>
kernel: CPU: 2 PID: 21104 Comm: kworker/2:1 Not tainted 5.15.13-arch1-1 #1 51d00698bfdb139ecff7a73f09034830de5a04f4
kernel: Hardware name: Gigabyte Technology Co., Ltd. H270M-DS3H/H270M-DS3H-CF, BIOS F8d 03/09/2018
kernel: Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
kernel: RIP: 0010:ttm_bo_release+0x2da/0x300 [ttm]
kernel: Code: e8 9b 6c 7d f6 e9 c1 fd ff ff 49 8b 7e 98 b9 28 23 00 00 31 d2 be 01 00 00 00 e8 f1 8f 7d f6 49 8b 46 e8 eb 9e 48 89 e8 eb 99 <0f> 0b e9 47 fd f>
kernel: RSP: 0018:ffffbe6f02a03cc0 EFLAGS: 00010202
kernel: RAX: 0000000000000001 RBX: ffffbe6f02a03d08 RCX: 0000000000000000
kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9675dc2329b8
kernel: RBP: ffff967392e05270 R08: ffff9675dc2329b8 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9674ba95da30
kernel: R13: ffff9675dc232858 R14: ffff9675dc2329b8 R15: dead000000000100
kernel: FS:  0000000000000000(0000) GS:ffff967a8ed00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000055b83804c4e8 CR3: 00000005d2e10006 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  amdgpu_bo_unref+0x1a/0x30 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel:  amdgpu_gem_object_free+0x30/0x50 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel:  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x364/0x3d0 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel:  kfd_process_device_free_bos+0x9f/0xf0 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel:  kfd_process_wq_release+0x20d/0x2e0 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel:  process_one_work+0x1e8/0x3c0
kernel:  worker_thread+0x50/0x3c0
kernel:  ? process_one_work+0x3c0/0x3c0
kernel:  kthread+0x132/0x160
kernel:  ? set_kthread_struct+0x50/0x50
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>
kernel: ---[ end trace 502f44bede71cd07 ]---
jenshannoschwalm commented 2 years ago

Hi, maybe continue via email and report here back after we found out the real problem? (hanno@schwalm-bremen.de)

Anyway atm my best guess for the problem would be the rcd border handling. Would you please try and report back your findings, see src/iop/demosaic.c

  1. In line 3431 change myborder to 6
  2. In line 3434 &dev_tmp -> &dev_aux
  3. and comment the stuff from between lines 3442 to 3483

I my hypothesis holds this should give you slightly worse output in the outermost 6 lines but should behave stable.

TurboGit commented 2 years ago

@piratenpanda : Can you test https://github.com/darktable-org/darktable/pull/10841? And report if all goes well, I'll then merge for 3.8.1. TIA.

piratenpanda commented 2 years ago

The PR was created with me having tested the changes already. Sorry for not being more clear. I could not reproduce the bug anymore so I think it's fine to merge

TurboGit commented 2 years ago

@piratenpanda : Thanks for the feedback, you were probably clear, but I do too many things I suppose and can get confused at some point.