darktable-org / darktable

darktable is an open source photography workflow application and raw developer
https://www.darktable.org
GNU General Public License v3.0
9.76k stars 1.14k forks source link

Random crash when using darktable with OpenCL enabled with AMD 6750XT GPU #15980

Closed talmholt closed 10 months ago

talmholt commented 10 months ago

Describe the bug

I have an HP840 with dual Xeon E5-2667 v3 and a RX 6750 XT and spent the majority of a day to get the drivers to work... I have installed Ubuntu multiple times as the "no-dkms" is extremely important to include as it will break your system.

image

I use Ubuntu 23.10 (downloaded December 2023). I used the guide below:

https://amdgpu-install.readthedocs.io/en/latest/install-installing.html

This scripts:

https://repo.radeon.com/amdgpu-install/5.7.2/ubuntu/jammy/

These options

amdgpu-install -y --accept-eula --usecase=rocmdev,opencl--opencl=rocr --opengl=mesa --vulkan=amdvlk --no-dkms

In general it works well, I get very good speedup.. but randomly it will freeze up and on some occations it will freeze up the entire screen. (The computer actually keeps running, but all monitors have frozen) and other times it just errors out with a segmentation fault.

The best way to replicate this is to have darktable regenerate a large cache of previews using "darktable-generate-cache". My installation will crash after 5-10K images have been worked on. However, if I start again, the images that caused the fault will render just fine.

Last time it crashed it provided a lot of debug information on the terminal and I have copied it into a text file for you to refer to. (please see attached txt file)

Regards, Thomas

Steps to reproduce

The best way to replicate this is to have darktable regenerate a large cache of previews using "darktable-generate-cache". My installation will crash after 5-10K images have been worked on. However, if I start again, the images that caused the fault will render just fine. So this is not a repeatable fault.

I also have to admit that this could be AMD ROCM issue and I have not been able to make ROCM 6.0 work as of yet.

Expected behavior

No response

Logfile | Screenshot | Screencast

No response

Commit

No response

Where did you obtain darktable from?

downloaded from www.darktable.org

darktable version

4.7.0+66~gff894e7060

What OS are you using?

Linux

What is the version of your OS?

Ubuntu 23.10 (Fresh install with ROCM added)

Describe your system?

This is a dual socket Xeon machine with two E5-2667 V3 at stock frequencies with a AMD 6750 XT GPU installed.

Are you using OpenCL GPU in darktable?

None

If yes, what is the GPU card and driver?

AMD 6750 XT, 12GB, using kernel drivers and with ROCM 5.7.2 installed ( https://repo.radeon.com/amdgpu-install/5.7.2/ubuntu/jammy/)

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

This GDB supports auto.txt

talmholt commented 10 months ago

I had one more crash after a few more hours of work.

darktable_bt_449OG2.txt

jenshannoschwalm commented 10 months ago

In your late log it seems to be crashing inside gtk. Pretty difficult to guess what happens here. Can you confirm that it happens only with opencl on?

You might also generate a more informative log with "darktable -d pipe -d expose"...

talmholt commented 10 months ago

Yes, if I disable OpenCL some inside darktable (option available in the processing tab), then I do not have any crashes when doing the exact same tasks and darktable is really good at using all the cores of the Xeon's, so it is almost as fast as with opencl enabled...

I have seen the crash when messing with the "retouch module" or other modules that have significant amounts of processing involved, so maybe some timing related bug... not sure.

BTW: I am a huge fan of darktable and the work you guys do, my hope is help find small bugs and improve the overall product.

I will also try to upgrade to ROCm 6.0 and see if that fixes the issues... but there I was actually having issues with getting the PC to boot (Ubuntu would crash on startup) and I will also go buy a newer generation AMD GPU (my plan is RX 7800) and see if a combination of new generation GPU and ROCm might do the trick.

jenshannoschwalm commented 10 months ago

Anything new to report?

talmholt commented 10 months ago

No, lets close the ticket until I will get new HW and I will report back.

talmholt commented 9 months ago

I have an update and good news...

Before I was using a Ubuntu 23.10 which is not officially supported.. so I went ahead and tried 22.04.03 LTS which is the latest officially supported version of Ubuntu (supported by AMD). Lets Ubuntu update the latest kernel version 6.5.15 (which has native support for the AMD GPU's)

Then installed the brand new version 6.0.0 of the rocm enablement

https://repo.radeon.com/amdgpu-install/6.0.0/ubuntu/jammy/

using this command... the "--no-dkms" as it stops the installation of the AMD driver into the kernel ... which is already there on the new 6.5.15 kernel ... so it actually does something bad if enabled.

amdgpu-install -y --accept-eula --usecase=rocmdev,opencl--opencl=rocr --opengl=mesa --no-dkms

Remember to add video and render to group:

sudo usermod -a -G video $LOGNAME sudo usermod -a -G render $LOGNAME

Rebooted the system.

Installed a brand new compiled from software version of darktable (latest pull from yesterday).

And the machine does not crash... Sorry for raising the bug... but now you can point AMD users to this ticket and they can get it to work.

This ticket can be consider resolved

jenshannoschwalm commented 9 months ago

Thanks a lot for Feedback !