Closed piratenpanda closed 2 years ago
Seems the latest RCD related changes were in the 3.6 cycle
Difficult stuff.
ATM i can't reproduce at all, there have been NaN issues but as far as i remember not related to RCD. I also read through the rcd related kernels and there was no obvious div-by-zero problem.
At least please check on dt 3.8 master, tell about graphics and system memory and the green smoothing setting you are using.
Also if you compile yourself you could disable opencl for rcd demosaicing and see if this resolves the issue.
In line 5589 in iop/demosaic.c
case DT_IOP_DEMOSAIC_RCD:
piece->process_cl_ready = 1;
break;
set the enable flag to 0.
Also - raw and xmp files might be helpful
So I built latest master with the change you suggested. With opencl off for RCD I can't reproduce the issue at all, with it on it almost immediately happens for some but not all images.
The green blocks look like this: .
CR3 file plus XMP file here: https://pandainthecloud.de/nextcloud/index.php/s/figyiiqcM6b7LSy
Graphics card info:
Extended renderer info (GLX_MESA_query_renderer):
Vendor: AMD (0x1002)
Device: AMD Radeon RX 580 Series (POLARIS10, DRM 3.42.0, 5.15.12-arch1-1, LLVM 13.0.0) (0x67df)
Version: 21.3.3
Accelerated: yes
Video memory: 8192MB
Unified memory: no
Preferred profile: core (0x1)
Max core profile version: 4.6
Max compat profile version: 4.6
and 32 GB of RAM
Only with cr3 files?
Had this happen with cr2s also. I can take a look if I can find one I can share
Not a particularly good image but it suffers a lot from the outer border artifacts for the fast demosaicer as well as for RCD: https://pandainthecloud.de/nextcloud/index.php/s/pWkWxFEiaabzs75
green artifacts:
Quick demosaicer artifacts:
The images you sent show border artefacts. Yes - RCD and PPG both use basically the same demosaicer for the border regions. (BTW all demosaicers use some special treatment of the border region)
But is this really related to any crash your reported first?
I tried your image and xmp, here on nvidia opencl there is nothing wrong.
I don't know if they are related. I thought NaNs cause those artifacts but obviously I was wrong.
So it might be a driver issue after all? Do we have a place to report to AMD? It's crashing very consistently for me
Those border artefacts are no NaN problems. In PPG you should see artefacts in the outermost 3 pixels, in RCD it might be up to 6 pixels.
I looked throught the opencl kernel code for rcd but can't find any problematic region at all. We suspected such problems a while ago with cpu code path but the problems could be found elsewhere.
Yes - it might be a driver issue. Really worrying are the black "v" looking artefacts. Those could be NaN.
Could you
There have been quite a number of AMD driver issues around. I have no idea how to specifically track those ...
This is the stacktrace I got for the latest crash: https://www.toptal.com/developers/hastebin/elagijalow.yaml
Also some more in https://github.com/darktable-org/darktable/issues/10082
About those artifacts I don't know how to log them in a meaningful way.
I will have a deep look into this next week.
Just to make sure about a very vague idea - Did you switch on dual demosaicing or the details mask before a crash?
BTW in the mentioned amd issue 1654 you mentioned the specific driver version. Is this issue related to that special driver? Have you tried other drivers? (I know there are several around ...)
Just surface blur and a parametric mask set to chromaticity is all i did. Usually I remove shapes from the mask but this crashes even without shapes. The linked files are at the state which makes them crash after a while of zooming and panning. Sometimes it crashes when just loading the image.
I will test if it crashes without a mask.
I always run the latest opencl-amd package from aur. I don't know how to install an older one. I could take a look how they prepare the PKGBUILD but it involves a lot of deb files and extracting only the needed bits. So i am not sure if I could potentially install an older one.
Hi,
I'm the current maintainer of the opencl-amd
package I try to Google every once in a while if there are issues with the package and I stumbled upon your issue. I tried your file on my AMD 5700XT and Darktable 3.8 the default demosaic (RCD) creates these green artifacts but doesn't crash. Changing the demosaicing to AMaZE method works fine, no green artifacts. I don't know how to enable/disable OpenCL in darktable, but I think it's being used anyway.
If you want to see if a previous version of the opencl-amd
package works better for your GPU, you can download the PKGBUILD from here, remove the current opencl-amd
, and makepkg -s the PKGBUILD you just downloaded.
edit: by the way disabling / enabling surface blur seems to make the green artifacts dissapear so it works fine with RCD too. I have no idea how to use this software so if I'm saying anything stupid please correct me :)
Might be a Polaris issue. Thank you very much for the PKGBUILD history, I'll try some older version to see if one works fine.
Oh, thanks for stepping in. I started just today thinking about this issue and i think you are just right. The green border artefacts simply shouldn't be there! As described they seem to be visible whether PPG or RCD are used and this is important as both demosaicers use the same code for border interpolation :-) So we should track that code section and see if the crashes go after a proper fix :-) We will track this down and report back if there is anything that might be of value for you.
@piratenpanda can you canfirm the green border issues without any fancy masking stuff?
yes, completely without any masks:
So I downgraded to 19.50, the one before the note "Updated to latest version, which may or may not work properly on Polaris GPUs" and so far I have not been able to reproduce the crash. All versions above I tried show the crash.
There is a first idea you should check. in data/kernels/demosaic_rcd.cl you will find the rcd_border_green and rcd_border_redblue functions. Each write back to the image buffer at the very end, we should ensure values of >= 0 to be written. Could you try to do for the green border
color.y = fmax(0.0f, color.y);
write_imagef (out, (int2)(x, y), color);
and for the redblue border
color.x = fmax(0.0f, color.x);
color.z = fmax(0.0f, color.z);
write_imagef (out, (int2)(x, y), color);
Aah, didn't see your latest comment. That would be simply a driver issue. What about the green borders?
So far no more color artifacts at the border but I need to do some more testing. The black PPG blocks still show though while zooming
While zooming would be a different algo ...
So it seems in RCD the blocks are also there now but much smaller:
While zooming would be a different algo ...
I thought quick demosaicing is PPG? I missunderstood you then it seems
No while zooming in there is more involved
ok so then the blocks and "V" shapes above are showing while zooming just to avoid further misunderstandings
Looks like driver issues to me then to be honest. Different artifacts for different driver versions are strange
I don't know if that helps but I can't replicate the green artifacts. I tried to delete the darktable config directory, tried to reboot, deleted the .xmp file, nothing. I only saw the green artifacts the first time I opened the image. I'm not sure if it's related, because my PC is also running other software, but near the time I opened the image this driver error occurred (I've removed some lines to make it more readable):
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 6 PID: 23520 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2da/0x300 [ttm]
kernel: Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
kernel: RIP: 0010:ttm_bo_release+0x2da/0x300 [ttm]
kernel: Call Trace:
kernel: <TASK>
kernel: amdgpu_bo_unref+0x1a/0x30 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel: amdgpu_gem_object_free+0x30/0x50 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel: amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x364/0x3d0 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel: kfd_process_device_free_bos+0x9f/0xf0 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel: kfd_process_wq_release+0x20d/0x2e0 [amdgpu a0d05b6b93a668719d5e3cb4650a86f14d07a684]
kernel: process_one_work+0x1e8/0x3c0
kernel: worker_thread+0x50/0x3c0
kernel: ? process_one_work+0x3c0/0x3c0
kernel: kthread+0x132/0x160
kernel: ? set_kthread_struct+0x50/0x50
kernel: ret_from_fork+0x22/0x30
kernel: </TASK>
kernel: ---[ end trace 34b8ec6dd0e109ea ]---
I also see that the .xmp file that is being modified is much different. Yours has 17 history steps but mine 11. But I don't think it's related. You could try to find similar stack traces with journalctl -b0 -k
That would hint to some sort of mem management problem.
I tried the latest version again to check. I also see:
kernel: WARNING: CPU: 2 PID: 21104 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2da/0x300 [ttm]
kernel: Modules linked in: snd_seq_dummy snd_seq uas usb_storage nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs uv>
kernel: x_tables hid_logitech_hidpp hid_logitech_dj usbhid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd xhci_pci xhci_pci>
kernel: CPU: 2 PID: 21104 Comm: kworker/2:1 Not tainted 5.15.13-arch1-1 #1 51d00698bfdb139ecff7a73f09034830de5a04f4
kernel: Hardware name: Gigabyte Technology Co., Ltd. H270M-DS3H/H270M-DS3H-CF, BIOS F8d 03/09/2018
kernel: Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
kernel: RIP: 0010:ttm_bo_release+0x2da/0x300 [ttm]
kernel: Code: e8 9b 6c 7d f6 e9 c1 fd ff ff 49 8b 7e 98 b9 28 23 00 00 31 d2 be 01 00 00 00 e8 f1 8f 7d f6 49 8b 46 e8 eb 9e 48 89 e8 eb 99 <0f> 0b e9 47 fd f>
kernel: RSP: 0018:ffffbe6f02a03cc0 EFLAGS: 00010202
kernel: RAX: 0000000000000001 RBX: ffffbe6f02a03d08 RCX: 0000000000000000
kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9675dc2329b8
kernel: RBP: ffff967392e05270 R08: ffff9675dc2329b8 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9674ba95da30
kernel: R13: ffff9675dc232858 R14: ffff9675dc2329b8 R15: dead000000000100
kernel: FS: 0000000000000000(0000) GS:ffff967a8ed00000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 000055b83804c4e8 CR3: 00000005d2e10006 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel: <TASK>
kernel: amdgpu_bo_unref+0x1a/0x30 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel: amdgpu_gem_object_free+0x30/0x50 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel: amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x364/0x3d0 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel: kfd_process_device_free_bos+0x9f/0xf0 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel: kfd_process_wq_release+0x20d/0x2e0 [amdgpu 431cbfe15e135bf6dbdd5236fab7a1247d7dce5b]
kernel: process_one_work+0x1e8/0x3c0
kernel: worker_thread+0x50/0x3c0
kernel: ? process_one_work+0x3c0/0x3c0
kernel: kthread+0x132/0x160
kernel: ? set_kthread_struct+0x50/0x50
kernel: ret_from_fork+0x22/0x30
kernel: </TASK>
kernel: ---[ end trace 502f44bede71cd07 ]---
Hi, maybe continue via email and report here back after we found out the real problem? (hanno@schwalm-bremen.de)
Anyway atm my best guess for the problem would be the rcd border handling. Would you please try and report back your findings, see src/iop/demosaic.c
I my hypothesis holds this should give you slightly worse output in the outermost 6 lines but should behave stable.
@piratenpanda : Can you test https://github.com/darktable-org/darktable/pull/10841? And report if all goes well, I'll then merge for 3.8.1. TIA.
The PR was created with me having tested the changes already. Sorry for not being more clear. I could not reproduce the bug anymore so I think it's fine to merge
@piratenpanda : Thanks for the feedback, you were probably clear, but I do too many things I suppose and can get confused at some point.
I was suffering a lot of crashes when using the surface blur module lately. See https://github.com/darktable-org/darktable/issues/10082 for reference.
While it wasn't quite obvious where the issue came from it seems that NaNs cause the module to crash as per the last stacktrace and comment here: https://discuss.pixls.us/t/amd-opencl-problems-in-surface-blur-darktable-module/28507/12
I suspect the NaNs are coming from RCD (opencl version) as when using AMAZE, I don't see green blocks at the border regions and also can't reproduce the crashes I had in the surface blue module so far.
This is happening since a while now. I will try to bisect RCD changes to see if there was a specific commit introducing the bug as I was able to use surface blur without issues before in the 3.8 development cycle. I am happy for further input on how to help providing more information that could be helpful.
Since when: In the 3.8 development cycle Graphics card: AMD Radeon RX 580 Series (POLARIS10, DRM 3.42.0, 5.15.12-arch1-1, LLVM 13.0.0) running the latest opencl-amd driver available on Arch (OpenCL 2.0 AMD-APP (3354.7) Gnome 41.2 on Wayland Intel i5-7600K CPU Darktable compiled as Release build (also happens in the Arch 3.8.0 build)