[Feature Request]: Add System-level optimization for CPU inference to wiki

LynxPDA commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

The work on the CPU can be quite long. Using some system optimizations, borrowed from HuggingFace, it turned out to increase the speed of work from 1.25x to 1.5x.

For my inference:

Xeon E3 1265L v3 (16Gb, 4 core) speed up from 10s/it to 8s/it
Ryzen 9 7950X (32Gb, 16 core) speed up from 2.54s/it to 1.7s/it

Proposed workflow

I added the following lines to the end of the webui-user.sh file:

export OMP_NUM_THREADS=16 export MKL_NUM_THREADS=16 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so:$LD_PRELOAD export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000" export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libiomp5.so:$LD_PRELOAD

Having previously installed

sudo apt-get install -y libjemalloc-dev
sudo apt-get install intel-mkl

Additional information

Other system informations:

COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"

python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978

OS Ubuntu 22.04

catboxanon commented 1 year ago

The wiki is editable by anyone. https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/10180

devingfx commented 1 year ago

Hi @LynxPDA !

Could you please explain a bit what options does and how to tweak for other CPU capabilities?

Thx

LynxPDA commented 1 year ago

Hi @LynxPDA !

Could you please explain a bit what options does and how to tweak for other CPU capabilities?

Thx

Hi, @devingfx Certainly,

jmalloc is a memory allocator. The weights of the model when working on the CPU are in RAM and the work can be accelerated by a little tuning.

background_thread

Enabling jemalloc background threads generally improves the tail latency for application threads, since unused memory purging is shifted to the dedicated background threads. In addition, unintended purging delay caused by application inactivity is avoided with background threads.

Suggested: background_thread:true when jemalloc managed threads can be allowed.

metadata_thp

Allowing jemalloc to utilize transparent huge pages for its internal metadata usually reduces TLB misses significantly, especially for programs with large memory footprint and frequent allocation / deallocation activities. Metadata memory usage may increase due to the use of huge pages.

Suggested for allocation intensive programs: metadata_thp:auto or metadata_thp:always, which is expected to improve CPU utilization at a small memory cost.

dirty_decay_ms and muzzy_decay_ms

Decay time determines how fast jemalloc returns unused pages back to the operating system, and therefore provides a fairly straightforward trade-off between CPU and memory usage. Shorter decay time purges unused pages faster to reduces memory usage (usually at the cost of more CPU cycles spent on purging), and vice versa.

Suggested: tune the values based on the desired trade-offs.

More details on tuning and each of the parameters can be found at https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md

And it is this setting that gives the maximum performance increase on the CPU.

As for libiomp5.so, it allows you to optimize parallel processing on the CPU. And this optimization also gives a small gain, but much less than with jmalloc.

I think these same settings will work for most other processors. The only condition is probably to install jmalloc and work in linux.

devingfx commented 1 year ago

Hi, thx for this quick reply!

After more in depth reading, there is a typo on line ending:

dirty_decay_ms: 60000,muzzy_decay_ms:>

... looks like a nano copy/paste with line truncated ^^;

I meant also for NUM_THREADS part... I don't know how to figure out my potato PC capabilities... I know I don't have a GPU...

System:    Kernel: 5.15.0-58-generic x86_64 bits: 64 compiler: N/A Desktop: Cinnamon 5.2.7 
           wm: muffin dm: LightDM Distro: Linux Mint 20.3 Una base: Ubuntu 20.04 focal 
Machine:   Type: Desktop Mobo: Gigabyte model: B450M S2H serial: <filter> 
           UEFI: American Megatrends LLC. v: F62d date: 10/13/2021 
CPU:       Topology: 6-Core model: AMD Ryzen 5 1600 bits: 64 type: MT MCP arch: Zen rev: 1 
           L2 cache: 3072 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 76654 
           Speed: 1481 MHz min/max: 1550/3200 MHz Core speeds (MHz): 1: 1594 2: 1326 3: 1398 
           4: 1375 5: 1352 6: 1374 7: 1382 8: 1374 9: 1742 10: 1339 11: 1461 12: 1378 
Graphics:  Device-1: NVIDIA GK208B [GeForce GT 710] vendor: ASUSTeK GT710-SL-1GD5 driver: nvidia https://www.asus.com/motherboards-components/graphics-cards/asus/gt710-sl-1gd5/techspec/
           v: 470.161.03 bus ID: 0a:00.0 chip ID: 10de:128b 
           Display: x11 server: X.Org 1.20.13 driver: nvidia 
           unloaded: fbdev,modesetting,nouveau,vesa resolution: 1920x1080~60Hz, 1920x1080~60Hz 
           OpenGL: renderer: NVIDIA GeForce GT 710/PCIe/SSE2 v: 4.6.0 NVIDIA 470.161.03 
           direct render: Yes

devingfx commented 1 year ago

Is the intel-mkl package needed for non-intel CPU?

devingfx commented 1 year ago

I get :

################################################################
ERROR: ld.so: object '/usr/lib/x86_64-linux-gnu/libiomp5.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
<jemalloc>: Invalid conf value: muzzy_decay_ms:>

I tried with muzzy_decay_ms:60000, same errors...

LynxPDA commented 1 year ago

Hi, thx for this quick reply!

After more in depth reading, there is a typo on line ending:

dirty_decay_ms: 60000,muzzy_decay_ms:>

... looks like a nano copy/paste with line truncated ^^;

I meant also for NUM_THREADS part... I don't know how to figure out my potato PC capabilities... I know I don't have a GPU...

System:    Kernel: 5.15.0-58-generic x86_64 bits: 64 compiler: N/A Desktop: Cinnamon 5.2.7 
           wm: muffin dm: LightDM Distro: Linux Mint 20.3 Una base: Ubuntu 20.04 focal 
Machine:   Type: Desktop Mobo: Gigabyte model: B450M S2H serial: <filter> 
           UEFI: American Megatrends LLC. v: F62d date: 10/13/2021 
CPU:       Topology: 6-Core model: AMD Ryzen 5 1600 bits: 64 type: MT MCP arch: Zen rev: 1 
           L2 cache: 3072 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 76654 
           Speed: 1481 MHz min/max: 1550/3200 MHz Core speeds (MHz): 1: 1594 2: 1326 3: 1398 
           4: 1375 5: 1352 6: 1374 7: 1382 8: 1374 9: 1742 10: 1339 11: 1461 12: 1378 
Graphics:  Device-1: NVIDIA GK208B [GeForce GT 710] vendor: ASUSTeK GT710-SL-1GD5 driver: nvidia https://www.asus.com/motherboards-components/graphics-cards/asus/gt710-sl-1gd5/techspec/
           v: 470.161.03 bus ID: 0a:00.0 chip ID: 10de:128b 
           Display: x11 server: X.Org 1.20.13 driver: nvidia 
           unloaded: fbdev,modesetting,nouveau,vesa resolution: 1920x1080~60Hz, 1920x1080~60Hz 
           OpenGL: renderer: NVIDIA GeForce GT 710/PCIe/SSE2 v: 4.6.0 NVIDIA 470.161.03 
           direct render: Yes

I'm sorry, you're absolutely right, the line is truncated at the end when copying from the nano editor.

Corrected. The correct option is below.

export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"

LynxPDA commented 1 year ago

Is the intel-mkl package needed for non-intel CPU?

Yes, intel-mkl can work with AMD processor too. However, it is quite large. You can try installing it separately. If I'm not mistaken, it is also part of Clang and llvm.

sudo apt-get install libomp-dev

devingfx commented 1 year ago

I get :

################################################################
ERROR: ld.so: object '/usr/lib/x86_64-linux-gnu/libiomp5.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
<jemalloc>: Invalid conf value: muzzy_decay_ms:>

I tried with muzzy_decay_ms:60000, same errors...

You didn't answer this...

Looks like setting muzzy/dirty to 30000 bypass the Invalid conf error, but still I have the ls.so error above... I didn't (yet) installed intel package, is it related?

devingfx commented 1 year ago

Another subject:

Did you get ControlNet to work on CPU? Looks like preprocessors are using GPU (I have a cuda out of memory error) despite the --use-cpu all param...

LynxPDA commented 1 year ago

Another subject:

Did you get ControlNet to work on CPU? Looks like preprocessors are using GPU (I have a cuda out of memory error) despite the --use-cpu all param...

Yes, the ControlNet on the CPU worked for me without problems with the command line arguments: --use-cpu all --no-half

If you are not using CUDA at all, you can install a nightly build of Pytorch for CPU only in venv. You're right, with command line arguments "--use-cpu all" should work.

LynxPDA commented 1 year ago

ERROR: ld.so: object

Regarding: ERROR: ld.so: object '/usr/lib/x86_64-linux-gnu/libiomp5.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored. Yes, please make sure that the library is available in the specified path, otherwise you need to install it, as I wrote earlier.

As I wrote earlier, you can try to install the missing library separately, without installing intel-mkl using the command below

sudo apt-get install libomp-dev

hollowshiroyuki commented 1 year ago

Hi, thanks for your optimizations ! I went from 30s/it to about 13s/it on my i7-1260P. But I have the feeling there is still space for improvement on my setup : btop tells me python is using about 40% cpu and I still have about 4Gb of ram free. Could you give me advice about how I should tweak your settings or what could I add to get even better performances ?

devingfx commented 1 year ago

Hi all!

On my side I could not get everthing working on CPU, I have got CUDA out of memory errors really often (even if it should not use CUDA on CPU only isn't it?) on image generation (that I restart and it work the 2nd time), I can not use (what I analysed) the "side models" like RealESRGAN, ControlNet preprocessing or faceswap by example...

I installed A1111 with default config the 1st time, then I tweaked webui-user.sh afterward (nottably with --use-cpu all) , so my question: Is there a special install process/config for all venv stuff (like pyTorch) to be used on CPU only?

Also, maybe an issue with my hardware, I do have a GPU but a poor 2Go RAM so I want to use CPU, but maybe there are some automatic detections that found a GPU and want to use it ?

PS: also after a recent "update all but you don't know what going on" button click in extension tab, image generation stopped working :( , I planned to do fresh reinstall, it is why I asked if I should go to a special process this time?

LynxPDA commented 1 year ago

Hi, thanks for your optimizations ! I went from 30s/it to about 13s/it on my i7-1260P. But I have the feeling there is still space for improvement on my setup : btop tells me python is using about 40% cpu and I still have about 4Gb of ram free. Could you give me advice about how I should tweak your settings or what could I add to get even better performances ?

Please clarify with what parameters the results of 30s/it and 13s/it were obtained? Samplers, steps, resolution and etc.

The size of free RAM affects the maximum resolution of the generated more than the speed of generation.

As part of the optimization, you can try the following actions:

Install a nightly build of Pytorch for CPU only, for me this gave an additional speed gain.
Try to bind only Performance cores to the process execution using numactl. For example, for llama.cpp there was a problem that decoupling of effective cores increased performance at times. Presumably because Performance Cores have to wait for Efficient Cores to finish rendering. https://github.com/ggerganov/llama.cpp/discussions/572

LynxPDA commented 1 year ago

Hi all!

On my side I could not get everthing working on CPU, I have got CUDA out of memory errors really often (even if it should not use CUDA on CPU only isn't it?) on image generation (that I restart and it work the 2nd time), I can not use (what I analysed) the "side models" like RealESRGAN, ControlNet preprocessing or faceswap by example...

I installed A1111 with default config the 1st time, then I tweaked webui-user.sh afterward (nottably with --use-cpu all) , so my question: Is there a special install process/config for all venv stuff (like pyTorch) to be used on CPU only?

Also, maybe an issue with my hardware, I do have a GPU but a poor 2Go RAM so I want to use CPU, but maybe there are some automatic detections that found a GPU and want to use it ?

PS: also after a recent "update all but you don't know what going on" button click in extension tab, image generation stopped working :( , I planned to do fresh reinstall, it is why I asked if I should go to a special process this time?

I can suggest as an option - install a nightly build of Pytorch for CPU only in your virtual environment.

For example, step by step:

Copy the repository to a new folder git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui
Go to the repository and install a CPU-only nightly build of Pytorch into your virtualenv:
- python3 -m venv venv - create the installer env
- source venv/bin/activate - activate installer env
- pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu - install Pytorch
- Deactivate - deactivate python venv
Uncomment the line export COMMANDLINE_ARGS= in the webui-user.sh file and enter export COMMANDLINE_ARGS="--api --precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"
Add optional optimizations mentioned in the first post
Add models and run the webui.sh file as usual.

devingfx commented 1 year ago

Hi! Thx a lot for helping newbies, you rocks!

For information: There is a typo on --skip -torch-cuda-test it is --skip-torch-cuda-test (no space)

I followed the steps, everything goes well
1st launch: looks like I'm stuck after additional install of taming transformer (no more CPU activity, no more stdout after "Launching launch.py")
2nd launch: the same, but, I tried opening localhost:7860 and it is OK, but a bit strange I didn't got any print message as usual...
txt2img is working again \o/ , look like there is a great improvment generating now a 512x512 pic in 4.5 min against 8 - 10min before !! :heart: (I'll do some bench, and compare to old logs to give here a more precise measure of improvment) ( On the other hand I stay at 60% CPU and 75% RAM at maximum, I set {OMP/MKL}_NUM_THREADS=12 because my Ryzen 5 is 12 thread, is it correct or a complete misunderstand? )
I have got RealESRGAN working for the 1st time!! It never did before, Youhoo!
I have to install and check other "side models" like ControlNet, not done yet...

MalfreCryvertia commented 1 year ago

export OMP_NUM_THREADS=16 export MKL_NUM_THREADS=16 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so:$LD_PRELOAD export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000" export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libiomp5.so:$LD_PRELOAD

so far i can follow this on windows (not linux as most directions go) but how do i install jemmaloc libomp mkl etc for optimization like in the first post?

sangenan commented 7 months ago

@LynxPDA When I updated this repository to version 1.7, the method failed and could not accelerate normally. Do you have this problem?

LynxPDA commented 7 months ago

@LynxPDA When I updated this repository to version 1.7, the method failed and could not accelerate normally. Do you have this problem?

No, unfortunately there is no way to check this now. What error does the console fail with? You can try commenting out one line at a time and see at what stage the error appears.

These parameters have a greater effect on working with memory than on the program itself; perhaps there are some problems with the lines of the multithreading libraries:

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libiomp5.so:$LD_PRELOAD
export OMP_NUM_THREADS=16
export MKL_NUM_THREADS=16

Disabling them will slightly reduce the acceleration.

VeeDel commented 4 months ago

hey can anyone help me, i am using stable diffusion in a virtual GPU which is vast.ai, and i am not getting all of the API endpoints like txt2img etc.. etc.. even i set command line arg =---api in .bat file and and in webui-user.sh exported commandline_arg ---api the file manager this website is giving is jupyter notebook I am very distressed because of this issue, please someone help me

lalala-233 commented 4 months ago

It can't work in my computer.

                   -`                    lalala@lalala-arch 
                  .o+`                   ------------------ 
                 `ooo/                   OS: Arch Linux x86_64 
                `+oooo:                  Kernel: 6.8.7-arch1-1 
               `+oooooo:                 Uptime: 7 days, 54 mins 
               -+oooooo+:                Packages: 2421 (pacman) 
             `/:-:++oooo+:               Shell: zsh 5.9 
            `/++++/+++++++:              Resolution: 1920x1080 
           `/++++++++++++++:             DE: Plasma 6.0.4 
          `/+++ooooooooooooo/`           WM: KWin 
         ./ooosssso++osssssso+`          Theme: Breeze [GTK2/3] 
        .oossssso-````/ossssss+`         Icons: breeze [GTK2/3] 
       -osssssso.      :ssssssso.        Terminal: yakuake 
      :osssssss/        osssso+++.       CPU: Intel Xeon E3-1240L v5 (8) @ 3.200GHz 
     /ossssssss/        +ssssooo/-       GPU: NVIDIA GeForce GT 730 
   `/ossssso+/:-        -:/+osssso+-     Memory: 24980MiB / 32047MiB 
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/                           
 .`                                 `/

./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision full --opt-split-attention --listen --no-hashing  --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:21<00:00, 10.22s/it]

export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision autocast --opt-split-attention     --listen --no-hashing --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:21<00:00, 10.20s/it]

export OMP_NUM_THREADS=8
export MKL_NUM_THREADS=8
export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
export LD_PRELOAD=/opt/intel/oneapi/lib/intel64/libiomp5.so:$LD_PRELOAD
./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision autocast --opt-split-attention     --listen --no-hashing --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:23<00:00, 10.40s/it]

1girl,flat chest,castle,<lora:LCM_LoRA_Weights_SD15:1>
Steps: 8, Sampler: LCM, Schedule type: Automatic, CFG scale: 1.5, Seed: 3250536676, Size: 512x512, Model: AnythingV5Ink_v5PrtRE_2, Hypertile U-Net: True, Hypertile VAE: True, Version: v1.9.3

LynxPDA commented 4 months ago

@lalala-233 Please check that libjemalloc.so and libiomp5.so are installed and located at the specified addresses. Perhaps in Arch Linux their location is different from Ubuntu.

lalala-233 commented 4 months ago

Yeah, there are many differences between Archlinu and Ubuntu.

% pacman -Ss jemalloc
extra/jemalloc 1:5.3.0-3 [installed]
    General-purpose scalable concurrent malloc implementation

% pacman -Ss intel-mkl
extra/intel-oneapi-mkl 2023.2.0_49495-2 [installed]
    Intel oneAPI Math Kernel Library

After I installed these packages, I find their location different from Ubuntu.

% locate libjemalloc.so
/usr/lib/libjemalloc.so
/usr/lib/libjemalloc.so.2

% locate libiomp5.so   
/opt/intel/oneapi/compiler/2023.2.0/linux/compiler/lib/intel64_lin/libiomp5.so
/opt/intel/oneapi/lib/intel64/libiomp5.so

I think I installed the true packages, but...

export OMP_NUM_THREADS=8
export MKL_NUM_THREADS=8
export LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms: 60000,muzzy_decay_ms:60000"
export LD_PRELOAD=/opt/intel/oneapi/lib/intel64/libiomp5.so:$LD_PRELOAD
./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision autocast --opt-split-attention     --listen --no-hashing --enable-insecure-extension-access
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:23<00:00, 10.40s/it]

This barely changed on my computer. Perhaps it will be more effective when generating larger resolutions.

I read the link you mentioned.

The work on the CPU can be quite long. Using some system optimizations, borrowed from HuggingFace, it turned out to increase the speed of work from 1.25x to 1.5x.

jemalloc and tcmalloc are equally interesting. Here, I'm installing jemalloc as my tests give it a slight performance edge. It can also be tweaked for a particular workload, for example to maximize CPU utilization.

However, webui itself uses tcmalloc as the memory allocator, so the gain from switching to jemalloc is likely to be limited.

LynxPDA commented 4 months ago

@lalala-233 You can also try changing the launch code:

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
export LD_PRELOAD="/usr/lib/libjemalloc.so /opt/intel/oneapi/lib/intel64/libiomp5.so"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:60000,muzzy_decay_ms:60000"
echo $LD_PRELOAD
ldd ./webui.sh --use-cpu all --skip-torch-cuda-test --no-half --precision autocast --opt-split-attention     --listen --no-hashing --enable-insecure-extension-access

Xeon E3-1240L seems to have only 4 cores, changed the number in NUM_THREADS, although I don’t think this will affect the speed.
Slightly changed the command for exporting preloaded libraries.
echo $LD_PRELOAD and ldd should show that loading of both libraries is successful.

p.s. Yes, tcmalloc is used, but it's all about specific memory management settings.

VeeDel commented 4 months ago

will anyone address my issue..?

LynxPDA commented 4 months ago

@VeeDel This issue - optimization for CPU inference. Perhaps you should find a similar issue or create a new one.

lalala-233 commented 4 months ago

@LynxPDA

8 -> 4 don't affect the speed.
There is little change in performance.
ldd./webui.sh gives the error not a dynamic executable

If I add NO_TCMALLOC="True" ', the performance is slightly better, but still worse than not adding any parameters. I think it's probably because my CPU is already full by default.

hemant446 commented 4 months ago

I need help installing in amd can you please tell me how should i install on windows sudo apt-get install -y libjemalloc-dev sudo apt-get install intel-mkl on amd ? do i need to install them on stable diffusion folder also ? i already did env and other

AUTOMATIC1111 / stable-diffusion-webui