darktable-org / darktable

darktable is an open source photography workflow application and raw developer
https://www.darktable.org
GNU General Public License v3.0
9.66k stars 1.13k forks source link

Diffuse or Sharpen takes forever when using OpenCL #12411

Closed luator closed 2 years ago

luator commented 2 years ago

Did you buy darktable from an application store ? No.

Describe the bug/issue

When OpenCL is enabled, exporting an image seems to hang when the diffuse and sharpen module is used. It does not hang when disabling OpenCL. I get this behaviour since updating to version 4.0. I built darktable myself using the release-4.0.0 tag but observe the same issue when using the flatpak version.

To Reproduce For testing I used the RAW file from here with this xmp file (using the local contrast preset of diffuse and sharpen): lofoten_20210905-151506_01.arw.xmp.gz I also have the issue with other files, though, so I don't think it is specific to the file.

Without OpenCL:

/opt/darktable/bin/darktable-cli lofoten_20210905-151506.arw lofoten_20210905-151506_01.arw.xmp output.jpg --core -d perf --disable-opencl

Output:

1.400492 [dev] took 0.056 secs (0.116 CPU) to load the image.
1.479494 [export] creating pixelpipe took 0.074 secs (0.214 CPU)
1.485349 [dev_pixelpipe] took 0.006 secs (0.010 CPU) initing base buffer [export]
1.499755 [dev_pixelpipe] took 0.014 secs (0.040 CPU) processed `raw black/white point' on CPU, blended on CPU [export]
1.515087 [dev_pixelpipe] took 0.015 secs (0.039 CPU) processed `white balance' on CPU, blended on CPU [export]
1.529453 [dev_pixelpipe] took 0.014 secs (0.044 CPU) processed `highlight reconstruction' on CPU, blended on CPU [export]
1.766501 [dev_pixelpipe] took 0.237 secs (0.687 CPU) processed `demosaic' on CPU, blended on CPU [export]
1.819236 [dev_pixelpipe] took 0.053 secs (0.100 CPU) processed `exposure' on CPU, blended on CPU [export]
1.866179 [dev_pixelpipe] took 0.047 secs (0.160 CPU) processed `input color profile' on CPU, blended on CPU [export]
2.970447 [dev_pixelpipe] took 1.104 secs (4.364 CPU) processed `color calibration' on CPU, blended on CPU [export]
101.659930 [dev_pixelpipe] took 98.689 secs (387.800 CPU) processed `diffuse or sharpen' on CPU with tiling, blended on CPU [export]
103.584965 [dev_pixelpipe] took 1.925 secs (7.458 CPU) processed `filmic rgb' on CPU, blended on CPU [export]
104.844575 [dev_pixelpipe] took 1.260 secs (4.972 CPU) processed `output color profile' on CPU, blended on CPU [export]
104.921585 [dev_pixelpipe] took 0.077 secs (0.273 CPU) processed `display encoding' on CPU, blended on CPU [export]
104.921723 [dev_process_export] pixel pipeline processing took 103.442 secs (405.955 CPU)

With OpenCL:

/opt/darktable/bin/darktable-cli lofoten_20210905-151506.arw lofoten_20210905-151506_01.arw.xmp output.jpg --core -d perf

Output:

1.463741 [dev] took 0.057 secs (0.113 CPU) to load the image.
1.543896 [export] creating pixelpipe took 0.076 secs (0.214 CPU)
1.549727 [dev_pixelpipe] took 0.006 secs (0.015 CPU) initing base buffer [export]
1.560472 [dev_pixelpipe] took 0.011 secs (0.017 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
1.563526 [dev_pixelpipe] took 0.003 secs (0.002 CPU) processed `white balance' on GPU, blended on GPU [export]
1.566132 [dev_pixelpipe] took 0.003 secs (0.000 CPU) processed `highlight reconstruction' on GPU, blended on GPU [export]
1.816522 [dev_pixelpipe] took 0.250 secs (0.188 CPU) processed `demosaic' on GPU with tiling, blended on CPU [export]
1.883753 [dev_pixelpipe] took 0.067 secs (0.067 CPU) processed `exposure' on GPU, blended on GPU [export]
1.896418 [dev_pixelpipe] took 0.013 secs (0.008 CPU) processed `input color profile' on GPU, blended on GPU [export]
1.931907 [dev_pixelpipe] took 0.035 secs (0.031 CPU) processed `color calibration' on GPU, blended on GPU [export]

Manually killed here after 10min.

In case it helps, here is the output of nvidia-smi while it was running:

Sun Aug  7 08:55:44 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 N/A |                  N/A |
| 40%   65C    P0    N/A /  N/A |   1286MiB /  1998MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Expected behavior It should not hang when using OpenCL.

Which commit introduced the error I did not yet have time to bisect the issue but can try to do so if needed.

Platform

Additional context

TurboGit commented 2 years ago

Our pixelpipe expert will probably add some more information.

To me this is maybe expected. Note that when you export you are handling lot more pixels than when viewing the image in the darkroom. If the card does not have enough memory to store the whole image and all the intermediate buffer needed by the diffuse module tiling will be used and this can be very slow indeed.

Do you confirm that you are exporting at max dimension? A test would be to export the image at say 1920x1920 and see if the speed is better.

luator commented 2 years ago

I see. I was indeed using max dimensions. Setting the dimensions to 1920x1920 in the export module didn't seem to help (but does this even scale down before the processing?), however using a smaller jpg as base image helped indeed (diffuse and sharpen finished in ~6s).

So it seems my (admittedly not very new) GPU simply has too little memory to be useful here :(. I'm just wondering, was there some change in 4.0 that made diffuse and sharpen so much more computationally expensive? With 3.8 it already takes 9-10 minutes on that picture but with 4.0 I once waited for around 45 minutes and it still didn't finish.

jenshannoschwalm commented 2 years ago
  1. You should first set the resource level to large
  2. To get more ideas what might go wrong use "-d tiling - d opencl" as parameters on the command line.

Probably there is very heavy tiling thus taking very long but it should finish.

luator commented 2 years ago
  1. You should first set the resource level to large

I'm not sure what you mean by that. Where/how can I do this?

I noticed this is an older Nvidia card, the GTX760. Is it 2G memory?

Yes.

jenshannoschwalm commented 2 years ago

You will find the setting in preferences->peocessing

luator commented 2 years ago

Thanks, I missed it when first checking there. Setting the resource level to large made a huge difference. Diffuse or sharpen still took a while but finished after ~15min (which is at least in a similar order of magnitude to what I measured with 3.8).

Here is the debug output:

130.730521 [dev] took 0.000 secs (0.000 CPU) to load the image.
130.835756 [export] creating pixelpipe took 0.082 secs (0.167 CPU)
130.835775 [opencl_update_enabled] enabled flag set to ON
130.835789 [pixelpipe_process] [export] using device 0
130.835800 [dt_opencl_check_device_available] use 1405MB (tunemem=OFF, pinning=OFF) on device `NVIDIA GeForce GTX 760' id=0
130.842520 [dev_pixelpipe] took 0.006 secs (0.009 CPU) initing base buffer [export]
130.858288 [dev_pixelpipe] took 0.016 secs (0.012 CPU) processed `raw black/white point' on GPU, blended on GPU [export]
130.861906 [dev_pixelpipe] took 0.004 secs (0.003 CPU) processed `white balance' on GPU, blended on GPU [export]
130.864612 [dev_pixelpipe] took 0.003 secs (0.003 CPU) processed `highlight reconstruction' on GPU, blended on GPU [export]
130.899969 [default_process_tiling_cl_roi] **** tiling module 'demosaic' for image with input size 5504x3672 --> 5496x3672
130.899987 [default_process_tiling_cl_roi] (2x1) tiles with max input dimensions 3576x3672, pinned=OFF, good 2752x3676
130.899991 [default_process_tiling_cl_roi] process tile (0,0) size 2768x3672 at origin [0,0]
131.015401 [default_process_tiling_cl_roi] process tile (1,0) size 2760x3672 at origin [2736,0]
131.121758 [dev_pixelpipe] took 0.257 secs (0.182 CPU) processed `demosaic' on GPU with tiling, blended on CPU [export]
131.189442 [dev_pixelpipe] took 0.068 secs (0.067 CPU) processed `exposure' on GPU, blended on GPU [export]
131.202063 [dev_pixelpipe] took 0.013 secs (0.012 CPU) processed `input color profile' on GPU, blended on GPU [export]
image colorspace transform Lab-->RGB took 0.011 secs (0.011 GPU) [channelmixerrgb ]
131.239222 [dev_pixelpipe] took 0.037 secs (0.031 CPU) processed `color calibration' on GPU, blended on GPU [export]
131.353431 [default_process_tiling_cl_ptp] **** tiling module 'diffuse' for image with size 5496x3672 --> 5496x3672
131.353447 [default_process_tiling_cl_ptp] (17x12) tiles with max dimensions 2376x2379, pinned=OFF, good 328x331 and overlap 1024
131.353451 [default_process_tiling_cl_ptp] tile (0,0) size 2376x2379 at origin [0,0]
147.307974 [default_process_tiling_cl_ptp] tile (0,1) size 2376x2379 at origin [0,331]
163.230850 [default_process_tiling_cl_ptp] tile (0,2) size 2376x2379 at origin [0,662]
179.155741 [default_process_tiling_cl_ptp] tile (0,3) size 2376x2379 at origin [0,993]
195.078801 [default_process_tiling_cl_ptp] tile (0,4) size 2376x2348 at origin [0,1324]
210.796637 [default_process_tiling_cl_ptp] tile (1,0) size 2376x2379 at origin [328,0]
226.719534 [default_process_tiling_cl_ptp] tile (1,1) size 2376x2379 at origin [328,331]
242.640080 [default_process_tiling_cl_ptp] tile (1,2) size 2376x2379 at origin [328,662]
258.558224 [default_process_tiling_cl_ptp] tile (1,3) size 2376x2379 at origin [328,993]
274.477347 [default_process_tiling_cl_ptp] tile (1,4) size 2376x2348 at origin [328,1324]
290.192486 [default_process_tiling_cl_ptp] tile (2,0) size 2376x2379 at origin [656,0]
306.115454 [default_process_tiling_cl_ptp] tile (2,1) size 2376x2379 at origin [656,331]
322.036501 [default_process_tiling_cl_ptp] tile (2,2) size 2376x2379 at origin [656,662]
337.957782 [default_process_tiling_cl_ptp] tile (2,3) size 2376x2379 at origin [656,993]
353.880283 [default_process_tiling_cl_ptp] tile (2,4) size 2376x2348 at origin [656,1324]
369.594907 [default_process_tiling_cl_ptp] tile (3,0) size 2376x2379 at origin [984,0]
385.517028 [default_process_tiling_cl_ptp] tile (3,1) size 2376x2379 at origin [984,331]
401.441439 [default_process_tiling_cl_ptp] tile (3,2) size 2376x2379 at origin [984,662]
417.364542 [default_process_tiling_cl_ptp] tile (3,3) size 2376x2379 at origin [984,993]
433.286730 [default_process_tiling_cl_ptp] tile (3,4) size 2376x2348 at origin [984,1324]
449.000854 [default_process_tiling_cl_ptp] tile (4,0) size 2376x2379 at origin [1312,0]
464.925717 [default_process_tiling_cl_ptp] tile (4,1) size 2376x2379 at origin [1312,331]
480.843204 [default_process_tiling_cl_ptp] tile (4,2) size 2376x2379 at origin [1312,662]
496.767652 [default_process_tiling_cl_ptp] tile (4,3) size 2376x2379 at origin [1312,993]
512.685088 [default_process_tiling_cl_ptp] tile (4,4) size 2376x2348 at origin [1312,1324]
528.399114 [default_process_tiling_cl_ptp] tile (5,0) size 2376x2379 at origin [1640,0]
544.322133 [default_process_tiling_cl_ptp] tile (5,1) size 2376x2379 at origin [1640,331]
560.241583 [default_process_tiling_cl_ptp] tile (5,2) size 2376x2379 at origin [1640,662]
576.161611 [default_process_tiling_cl_ptp] tile (5,3) size 2376x2379 at origin [1640,993]
592.079757 [default_process_tiling_cl_ptp] tile (5,4) size 2376x2348 at origin [1640,1324]
607.793011 [default_process_tiling_cl_ptp] tile (6,0) size 2376x2379 at origin [1968,0]
623.715992 [default_process_tiling_cl_ptp] tile (6,1) size 2376x2379 at origin [1968,331]
639.635342 [default_process_tiling_cl_ptp] tile (6,2) size 2376x2379 at origin [1968,662]
655.556533 [default_process_tiling_cl_ptp] tile (6,3) size 2376x2379 at origin [1968,993]
671.477504 [default_process_tiling_cl_ptp] tile (6,4) size 2376x2348 at origin [1968,1324]
687.193854 [default_process_tiling_cl_ptp] tile (7,0) size 2376x2379 at origin [2296,0]
703.116036 [default_process_tiling_cl_ptp] tile (7,1) size 2376x2379 at origin [2296,331]
719.039459 [default_process_tiling_cl_ptp] tile (7,2) size 2376x2379 at origin [2296,662]
734.963102 [default_process_tiling_cl_ptp] tile (7,3) size 2376x2379 at origin [2296,993]
750.883624 [default_process_tiling_cl_ptp] tile (7,4) size 2376x2348 at origin [2296,1324]
766.597446 [default_process_tiling_cl_ptp] tile (8,0) size 2376x2379 at origin [2624,0]
782.522746 [default_process_tiling_cl_ptp] tile (8,1) size 2376x2379 at origin [2624,331]
798.442095 [default_process_tiling_cl_ptp] tile (8,2) size 2376x2379 at origin [2624,662]
814.366831 [default_process_tiling_cl_ptp] tile (8,3) size 2376x2379 at origin [2624,993]
830.289975 [default_process_tiling_cl_ptp] tile (8,4) size 2376x2348 at origin [2624,1324]
846.002453 [default_process_tiling_cl_ptp] tile (9,0) size 2376x2379 at origin [2952,0]
861.926427 [default_process_tiling_cl_ptp] tile (9,1) size 2376x2379 at origin [2952,331]
877.849028 [default_process_tiling_cl_ptp] tile (9,2) size 2376x2379 at origin [2952,662]
893.768761 [default_process_tiling_cl_ptp] tile (9,3) size 2376x2379 at origin [2952,993]
909.690316 [default_process_tiling_cl_ptp] tile (9,4) size 2376x2348 at origin [2952,1324]
925.401721 [default_process_tiling_cl_ptp] tile (10,0) size 2216x2379 at origin [3280,0]
940.905129 [default_process_tiling_cl_ptp] tile (10,1) size 2216x2379 at origin [3280,331]
956.403453 [default_process_tiling_cl_ptp] tile (10,2) size 2216x2379 at origin [3280,662]
971.906654 [default_process_tiling_cl_ptp] tile (10,3) size 2216x2379 at origin [3280,993]
987.407746 [default_process_tiling_cl_ptp] tile (10,4) size 2216x2348 at origin [3280,1324]
1002.706208 [dev_pixelpipe] took 871.467 secs (873.504 CPU) processed `diffuse or sharpen' on GPU with tiling, blended on CPU [export]
1002.706543 [default_process_tiling_cl_ptp] **** tiling module 'filmicrgb' for image with size 5496x3672 --> 5496x3672
1002.706549 [default_process_tiling_cl_ptp] (4x1) tiles with max dimensions 2784x3672, pinned=OFF, good 1760x2648 and overlap 512
1002.706552 [default_process_tiling_cl_ptp] tile (0,0) size 2784x3672 at origin [0,0]
1003.665744 [default_process_tiling_cl_ptp] tile (1,0) size 2784x3672 at origin [1760,0]
1003.729648 [default_process_tiling_cl_ptp] tile (2,0) size 1976x3672 at origin [3520,0]
1003.775227 [dev_pixelpipe] took 1.069 secs (1.063 CPU) processed `filmic rgb' on GPU with tiling, blended on CPU [export]
image colorspace transform RGB-->Lab took 0.051 secs (0.192 CPU) [colorout ]
1005.028212 [dev_pixelpipe] took 1.253 secs (4.978 CPU) processed `output color profile' on CPU, blended on CPU [export]
1005.103720 [dev_pixelpipe] took 0.075 secs (0.266 CPU) processed `display encoding' on CPU, blended on CPU [export]
1005.106340 [opencl_profiling] profiling device 0 ('NVIDIA GeForce GTX 760'):
1005.106345 [opencl_profiling] spent  1.0045 seconds in [Write Image (from host to device)]
1005.106348 [opencl_profiling] spent  0.0022 seconds in rawprepare_1f
1005.106350 [opencl_profiling] spent  0.0023 seconds in whitebalance_1f
1005.106352 [opencl_profiling] spent  0.0021 seconds in highlights_1f_clip
1005.106354 [opencl_profiling] spent  0.6698 seconds in [Read Image (from device to host)]
1005.106356 [opencl_profiling] spent  0.0014 seconds in border_interpolate
1005.106358 [opencl_profiling] spent  0.0046 seconds in rcd_border_green
1005.106362 [opencl_profiling] spent  0.0048 seconds in rcd_border_redblue
1005.106365 [opencl_profiling] spent  0.0024 seconds in rcd_populate
1005.106377 [opencl_profiling] spent  0.0077 seconds in rcd_step_1_1
1005.106379 [opencl_profiling] spent  0.0036 seconds in rcd_step_1_2
1005.106381 [opencl_profiling] spent  0.0042 seconds in rcd_step_2_1
1005.106383 [opencl_profiling] spent  0.0126 seconds in rcd_step_3_1
1005.106399 [opencl_profiling] spent  0.0068 seconds in rcd_step_4_1
1005.106401 [opencl_profiling] spent  0.0017 seconds in rcd_step_4_2
1005.106404 [opencl_profiling] spent  0.0103 seconds in rcd_step_5_1
1005.106406 [opencl_profiling] spent  0.0149 seconds in rcd_step_5_2
1005.106409 [opencl_profiling] spent  0.0050 seconds in rcd_write_output
1005.106411 [opencl_profiling] spent  0.0105 seconds in exposure
1005.106413 [opencl_profiling] spent  0.0106 seconds in colorin_unbound
1005.106416 [opencl_profiling] spent  0.0889 seconds in [Copy Image (on device)]
1005.106419 [opencl_profiling] spent  0.0117 seconds in colorspaces_transform_lab_to_rgb_matrix
1005.106421 [opencl_profiling] spent  0.0109 seconds in channelmixerrgb_CAT16
1005.106424 [opencl_profiling] spent  9.6774 seconds in blur_2D_Bspline_horizontal
1005.106435 [opencl_profiling] spent 42.5157 seconds in blur_2D_Bspline_vertical
1005.106438 [opencl_profiling] spent 20.4500 seconds in wavelets_detail_level
1005.106441 [opencl_profiling] spent 797.8551 seconds in diffuse_pde
1005.106454 [opencl_profiling] spent  0.0039 seconds in filmic_mask_clipped_pixels
1005.106456 [opencl_profiling] spent  0.0047 seconds in filmic_inpaint_noise
1005.106459 [opencl_profiling] spent  0.0088 seconds in init_reconstruct
1005.106462 [opencl_profiling] spent  0.2225 seconds in wavelets_reconstruct
1005.106464 [opencl_profiling] spent  0.0039 seconds in compute_ratios
1005.106466 [opencl_profiling] spent  0.0037 seconds in restore_ratios
1005.106469 [opencl_profiling] spent  0.0223 seconds in filmicrgb_chroma
1005.106471 [opencl_profiling] spent 872.6617 seconds totally in command queue (with 0 events missing)
1005.106587 [dev_process_export] pixel pipeline processing took 874.271 secs (880.134 CPU)
[export_job] exported to `/home/felix/Bilder/darktable_issues/diffuse_and_sharpen_slow/jpg/lofoten_20210905-151506__with_lc_03.jpg'

Out of curiosity I checked again with the "default" resource level and there it used (115x72) tiles (instead of 17x12). I guess this explains the difference.

I do not fully understand the output, though. As it said "(17x12) tiles", I assumed it would go on until it reaches these numbers in the following. However, it already finished after tile (10,4). Is there some documentation somewhere, where this is explained?


Summary:

Does this sound right to you? If yes, I think this issue can be closed. Thanks a lot for your help!

jenshannoschwalm commented 2 years ago

Your summary analysis looks OK.

The hint for tuning from @gi-man might also further help, especially with such a lot of tiling both settings might improve performance Individually.

2gb of graphics ram is very much on the low side.

luator commented 2 years ago

I did a comparison:

no tuning (from above)

(17x12) tiles with max dimensions 2376x2379, pinned=OFF, good 328x331 and overlap 1024 ... took 871.467 secs (873.504 CPU) processed `diffuse or sharpen' on GPU with tiling, blended on CPU [export]

memory size

(12x8) tiles with max dimensions 2536x2538, pinned=OFF, good 488x490 and overlap 1024 ... took 529.090 secs (529.970 CPU) processed `diffuse or sharpen' on GPU with tiling, blended on CPU [export]

memory transfer

(29x19) tiles with max dimensions 2244x2246, pinned=ON, good 196x198 and overlap 1024 (I did not wait for this to finish)

memory size and transfer

(16x11) tiles with max dimensions 2392x2395, pinned=ON, good 344x347 and overlap 1024 ... took 850.150 secs (853.016 CPU) processed `diffuse or sharpen' on GPU with tiling, blended on CPU [export]

So tuning for memory size brings the most improvement. However, it is still much slower than using the CPU (where no tiling is needed):

disabling OpenCL

took 57.010 secs (221.693 CPU) processed `diffuse or sharpen' on CPU, blended on CPU [export]

jenshannoschwalm commented 2 years ago

Your results are completely conclusive, unfortunately we can't predict the performance for modules in a reliable way so an automatic disabling of CL for a specific module is not possible, at least not now.

So I would suggest to close this issue for now.

luator commented 2 years ago

Thanks, I'm closing it. I think a manual option to disable OpenCL for the module would already be enough. But I saw that there is already a feature request for this (#5669).