darktable-org / darktable

darktable is an open source photography workflow application and raw developer
https://www.darktable.org
GNU General Public License v3.0
9.52k stars 1.12k forks source link

Performance regression with atrous module and newer kernels (e.g. 6.7.12 or 6.10.6) compared to 6.6.47 #17397

Open mabod opened 2 weeks ago

mabod commented 2 weeks ago

Describe the bug

The pixel pipeline processing with opencl disabled takes considerably longer with kernel 6.10.6 compared to 6.6.47.

With kernel 6.6.47:

     4,2765 [dev_process_export] pixel pipeline processing took 3,811 secs (81,883 CPU)

With kernel 6.10.6:

     5,2151 [dev_process_export] pixel pipeline processing took 4,773 secs (102,452 CPU)

The performance regression does not happen with opencl enabled.

The issue is repoducible with stock arch kernels

linux 6.10.6.arch1-1
linux-lts 6.6.47-1

and with linux-tkg kernels

linux610-tkg-eevdf-generic_v3 6.10.7-273.1
linux66-tkg-eevdf-generic_v3 6.6.48-273.1

I can not tell if it started before the 6.10 series.

Steps to reproduce

darktable-cli bench.SRW /tmp/aaa.jpg --core --disable-opencl --configdir /tmp -d all

resp.

darktable-cli bench.SRW /tmp/aaa.jpg --core --disable-opencl --configdir /tmp -d perf

bench.SRW and xmp file can be found here:

https://drive.google.com/drive/folders/1cfV2b893JuobVwGiZXcaNv5-yszH6j-N?usp=sharing

Expected behavior

darktable should show same performance for 6.10.6 and 6.6.47 kernel.

Logfile | Screenshot | Screencast

Full logs with -d all:

6.6.47-full-log.txt 6.10.6-full-log.txt

short logs with -d perf: kernel 6.6.47 with -d perf:

darktable 4.8.1
Copyright (C) 2012-2024 Johannes Hanika and other contributors.

Compile options:
  Bit depth              -> 64 bit
  Debug                  -> DISABLED
  SSE2 optimizations     -> ENABLED
  OpenMP                 -> ENABLED
  OpenCL                 -> ENABLED
  Lua                    -> ENABLED  - API version 9.3.0
  Colord                 -> ENABLED
  gPhoto2                -> ENABLED
  GMIC                   -> ENABLED  - Compressed LUTs are supported
  GraphicsMagick         -> ENABLED
  ImageMagick            -> DISABLED
  libavif                -> ENABLED
  libheif                -> ENABLED
  libjxl                 -> ENABLED
  OpenJPEG               -> ENABLED
  OpenEXR                -> ENABLED
  WebP                   -> ENABLED

See https://www.darktable.org/resources/ for detailed documentation.
See https://github.com/darktable-org/darktable/issues/new/choose to report bugs.

     0,0145 [dt_get_sysresource_level] switched to 1 as `default'
     0,0145   total mem:       64218MB
     0,0145   mipmap cache:    8027MB
     0,0145   available mem:   32109MB
     0,0145   singlebuff:      501MB
     0.0344 [opencl_init] opencl disabled due to explicit user request
     0.0344 [opencl_init] FINALLY: opencl PREFERENCE=ON is NOT AVAILABLE and NOT ENABLED.
     0,4141 [dt_dev_load_raw] loading the image. took 0,103 secs (0,141 CPU)
     0,4654 [export] creating pixelpipe took 0,048 secs (0,635 CPU)
     0,4710 [dev_pixelpipe] took 0,005 secs (0,060 CPU) initing base buffer [export]
     0,4778 [dev_pixelpipe] took 0,007 secs (0,107 CPU) [export] processed `rawprepare' on CPU, blended on CPU
     0,4850 [dev_pixelpipe] took 0,007 secs (0,103 CPU) [export] processed `temperature' on CPU, blended on CPU
     0,5038 [dev_pixelpipe] took 0,019 secs (0,476 CPU) [export] processed `highlights' on CPU, blended on CPU
     0,6445 [dev_pixelpipe] took 0,141 secs (2,964 CPU) [export] processed `demosaic' on CPU, blended on CPU
     0,9898 [dev_pixelpipe] took 0,345 secs (4,012 CPU) [export] processed `tonemap' on CPU, blended on CPU
     1,1660 [dev_pixelpipe] took 0,176 secs (3,233 CPU) [export] processed `lens' on CPU, blended on CPU
     1,1923 [dev_pixelpipe] took 0,026 secs (0,585 CPU) [export] processed `basecurve' on CPU, blended on CPU
     1,2111 [dev_pixelpipe] took 0,019 secs (0,476 CPU) [export] processed `colorin' on CPU, blended on CPU
     1,2631 [dev_pixelpipe] took 0,052 secs (1,228 CPU) [export] processed `colorreconstruct' on CPU, blended on CPU
     3,2125 [dev_pixelpipe] took 1,949 secs (44,624 CPU) [export] processed `nlmeans' on CPU, blended on CPU
     3,2898 [dev_pixelpipe] took 0,077 secs (1,829 CPU) [export] processed `globaltonemap' on CPU, blended on CPU
     3,4201 [dev_pixelpipe] took 0,130 secs (2,310 CPU) [export] processed `shadhi' on CPU, blended on CPU
     4,0548 [dev_pixelpipe] took 0,635 secs (14,597 CPU) [export] processed `atrous' on CPU, blended on CPU
     4,0920 [dev_pixelpipe] took 0,037 secs (0,860 CPU) [export] processed `bilat' on CPU, blended on CPU
     4,1425 [dev_pixelpipe] took 0,051 secs (1,199 CPU) [export] processed `colorzones' on CPU, blended on CPU
     4,1682 [dev_pixelpipe] took 0,026 secs (0,616 CPU) [export] processed `levels' on CPU, blended on CPU
     4,1981 [dev_pixelpipe] took 0,030 secs (0,703 CPU) [export] processed `sharpen' on CPU, blended on CPU
     4,2157 [dev_pixelpipe] took 0,018 secs (0,395 CPU) [export] processed `colorcontrast' on CPU, blended on CPU
     4,2495 [dev_pixelpipe] took 0,034 secs (0,794 CPU) [export] processed `colorout' on CPU, blended on CPU
     4,2764 [resample_plain] took 0,027 secs (0,711 CPU) 1:1 copy/crop of 5490x3660 pixels
     4,2765 [dev_pixelpipe] took 0,027 secs (0,712 CPU) [export] processed `finalscale' on CPU, blended on CPU
     4,2765 [dev_process_export] pixel pipeline processing took 3,811 secs (81,883 CPU)
     4,4968 [export_job] exported to `/tmp/aaa_03.jpg'

kernel.6.10.6 with -d perf

darktable 4.8.1
Copyright (C) 2012-2024 Johannes Hanika and other contributors.

Compile options:
  Bit depth              -> 64 bit
  Debug                  -> DISABLED
  SSE2 optimizations     -> ENABLED
  OpenMP                 -> ENABLED
  OpenCL                 -> ENABLED
  Lua                    -> ENABLED  - API version 9.3.0
  Colord                 -> ENABLED
  gPhoto2                -> ENABLED
  GMIC                   -> ENABLED  - Compressed LUTs are supported
  GraphicsMagick         -> ENABLED
  ImageMagick            -> DISABLED
  libavif                -> ENABLED
  libheif                -> ENABLED
  libjxl                 -> ENABLED
  OpenJPEG               -> ENABLED
  OpenEXR                -> ENABLED
  WebP                   -> ENABLED

See https://www.darktable.org/resources/ for detailed documentation.
See https://github.com/darktable-org/darktable/issues/new/choose to report bugs.

     0,0140 [dt_get_sysresource_level] switched to 1 as `default'
     0,0141   total mem:       64217MB
     0,0141   mipmap cache:    8027MB
     0,0141   available mem:   32108MB
     0,0141   singlebuff:      501MB
     0.0353 [opencl_init] opencl disabled due to explicit user request
     0.0353 [opencl_init] FINALLY: opencl PREFERENCE=ON is NOT AVAILABLE and NOT ENABLED.
     0,3930 [dt_dev_load_raw] loading the image. took 0,099 secs (0,101 CPU)
     0,4419 [export] creating pixelpipe took 0,045 secs (0,650 CPU)
     0,4475 [dev_pixelpipe] took 0,005 secs (0,013 CPU) initing base buffer [export]
     0,4543 [dev_pixelpipe] took 0,007 secs (0,112 CPU) [export] processed `rawprepare' on CPU, blended on CPU
     0,4614 [dev_pixelpipe] took 0,007 secs (0,136 CPU) [export] processed `temperature' on CPU, blended on CPU
     0,4800 [dev_pixelpipe] took 0,019 secs (0,399 CPU) [export] processed `highlights' on CPU, blended on CPU
     0,6225 [dev_pixelpipe] took 0,142 secs (3,020 CPU) [export] processed `demosaic' on CPU, blended on CPU
     0,9465 [dev_pixelpipe] took 0,324 secs (4,037 CPU) [export] processed `tonemap' on CPU, blended on CPU
     1,1165 [dev_pixelpipe] took 0,170 secs (3,260 CPU) [export] processed `lens' on CPU, blended on CPU
     1,1451 [dev_pixelpipe] took 0,029 secs (0,668 CPU) [export] processed `basecurve' on CPU, blended on CPU
     1,1884 [dev_pixelpipe] took 0,043 secs (0,940 CPU) [export] processed `colorin' on CPU, blended on CPU
     1,2633 [dev_pixelpipe] took 0,075 secs (1,606 CPU) [export] processed `colorreconstruct' on CPU, blended on CPU
     3,2736 [dev_pixelpipe] took 2,010 secs (44,790 CPU) [export] processed `nlmeans' on CPU, blended on CPU
     3,3438 [dev_pixelpipe] took 0,070 secs (1,616 CPU) [export] processed `globaltonemap' on CPU, blended on CPU
     3,4752 [dev_pixelpipe] took 0,131 secs (2,284 CPU) [export] processed `shadhi' on CPU, blended on CPU
     4,9645 [dev_pixelpipe] took 1,489 secs (33,736 CPU) [export] processed `atrous' on CPU, blended on CPU
     5,0024 [dev_pixelpipe] took 0,038 secs (0,874 CPU) [export] processed `bilat' on CPU, blended on CPU
     5,0576 [dev_pixelpipe] took 0,055 secs (1,321 CPU) [export] processed `colorzones' on CPU, blended on CPU
     5,0830 [dev_pixelpipe] took 0,025 secs (0,543 CPU) [export] processed `levels' on CPU, blended on CPU
     5,1188 [dev_pixelpipe] took 0,036 secs (0,796 CPU) [export] processed `sharpen' on CPU, blended on CPU
     5,1385 [dev_pixelpipe] took 0,020 secs (0,464 CPU) [export] processed `colorcontrast' on CPU, blended on CPU
     5,1879 [dev_pixelpipe] took 0,049 secs (1,196 CPU) [export] processed `colorout' on CPU, blended on CPU
     5,2151 [resample_plain] took 0,027 secs (0,637 CPU) 1:1 copy/crop of 5490x3660 pixels
     5,2151 [dev_pixelpipe] took 0,027 secs (0,637 CPU) [export] processed `finalscale' on CPU, blended on CPU
     5,2151 [dev_process_export] pixel pipeline processing took 4,773 secs (102,452 CPU)
     5,4228 [export_job] exported to `/tmp/aaa_05.jpg'

Commit

No response

Where did you obtain darktable from?

distro packaging

darktable version

4.8.1

What OS are you using?

Linux

What is the version of your OS?

Endeavouros (arch)

Describe your system?

  Info: 12-core model: AMD Ryzen 9 5900X bits: 64 type: MT MCP cache:  L2: 6 MiB
  Speed (MHz): avg: 1187 min/max: 550/4951 cores: 1: 550 2: 550 3: 550
  Device-1: AMD Navi 23 [Radeon RX 6650 XT / 6700S 6800S] driver: amdgpu

Are you using OpenCL GPU in darktable?

No

If yes, what is the GPU card and driver?

No response

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

No response

mabod commented 2 weeks ago

Seems to me that 'atrous' on CPU has an issue with kernel 6.10

6.6.47:

4,0548 [dev_pixelpipe] took 0,635 secs (14,597 CPU) [export] processed 'atrous' on CPU, blended on CPU

6.10.6:

4,9645 [dev_pixelpipe] took 1,489 secs (33,736 CPU) [export] processed 'atrous' on CPU, blended on CPU

mabod commented 2 weeks ago

I do not know if the relevant change came in 6.10 or 6.9 or 6.8 or 6.7. I only see that 'atrous' on CPU takes 2 times longer with kernel 6.10 than it takes with kernel 6.6. I do not know what to do more other than reporting this finding.

And I am honestly surprised that the kernel has so much influence on the performance anyways. Converting bench.SRW takes 3.8 s with kernel 6.6 but 4.7 s with kernel 6.10. That is significant.

I do not know how to debug this further.

mabod commented 2 weeks ago

I do not have a stock kernel 6.7.12. But I compiled 6.7.12 with linux-tkg and it has the same performance issue like kernel 6.10.6. The linux-tkg kernel 6.6.48 does not have that performance issue and also not stock arch LTS kernel 6.6.47.

Looks like the relevant change was introduce with kernel series 6.7

parafin commented 2 weeks ago

Maybe some new CPU-bug mitigation was introduced? Compare the output of this command with old and new kernel:

cat /sys/devices/system/cpu/vulnerabilities/*
mabod commented 2 weeks ago

Maybe some new CPU-bug mitigation was introduced? Compare the output of this command with old and new kernel:

cat /sys/devices/system/cpu/vulnerabilities/*

No change. 6.7.12 and 6.6.47 are the same.

They could change some task scheduling

That is a possibility. I am currently checking the change logs etc. to find evidence. So far I found:

https://lore.kernel.org/lkml/ZTz9RpZxfxysYCmt@gmail.com/

But on the other hand, if it is a scheduler thing, why does it only affect 'atrous' on CPU? It should affect all modules in the pixelpipe. But all the other modules take more or less the same time. Just atrous has double execution time.

mabod commented 2 weeks ago

I did a test with a changed scheduler but it makes no difference. I tested

amd_pstate=active with governor "powersave" and energy_performance_preference=balance_performance

and

amd_pstate=passive with governor "ondemand"

Both give the same result and confirm that starting with kernel series 6.7 the 'atrous' module consumes double the amount of time in the pixelpipe. All other modules are not affected by the kernel version change.

mabod commented 1 week ago

In the meantime I also tested kernel 6.5.13. The result is as expected. darktable is as fast as with kernel 6.6. The initial finding is confirmed:

Starting with kernel series 6.7 darktable looses ca. 23 % performance when exporting pictures. Exporting takes ca. 3.8 s per picture with kernel 6.6 and earlier, but 4.7 s per picture with kernel 6.7 and later. And the "atrous" module seems to be the biggest contributor to this issue.

For me it is really a surprise that the kernel has such a significant impact on darktable performance. May be this is memory management related. I dont know. Can somebody please look into this? May be there is a workaround or a simple cure for this performance drop.