Model loads to RAM and infers on CPU despite enabling Vulkan backend

LSXAxeller commented 1 week ago

I am attempting to utilize the Vulkan backend for inference. While I have explicitly enabled the Vulkan backend (Backends.VulkanBackend.IsEnabled = true;) and disabled all other backends (CPU and CUDA) and installed the required Vulkan backend NuGet package: StableDiffusion.NET.Backend.Vulkan, the model still loads into RAM and inference occurs on the CPU.

I have verified that my GPU (RX580 4GB) and Vulkan drivers are functioning correctly by successfully running the latest stable-diffusion.cpp binary with Vulkan, which loads the model onto the GPU as expected.

Model Loading Code:

Backends.CpuBackend.IsEnabled = false; 
Backends.CudaBackend.IsEnabled = false; 
Backends.VulkanBackend.IsEnabled = true; 
await Task.Run(() =>
{
    Model = ModelBuilder.StableDiffusion(modelPath)
        .WithVae(vaePath)
        .WithControlNet(controlNetPath)
        .WithLoraSupport(loraPath)
        .WithMultithreading(Environment.ProcessorCount)
        .Build();

    return Task.CompletedTask;
});

System Information:

GPU: RX580 4GB
RAM: 16GB
CPU: I5-11400F
OS: Windows 11

Expected Behavior:

The Stable Diffusion model should load onto the GPU and inference should be performed using Vulkan, resulting in faster inference times.

Actual Behavior:

The model loads into RAM and inference is performed on the CPU, leading to slower performance.

DarthAffe commented 1 week ago

I'm not able to reproduce this (using a nvidia GPU) it is definitely using the vulkan backend and behaves exactly the same as the stable-diffusion.cpp executable. If you check the sd.cpp logs (for example with this StableDiffusionCpp.Log += (_, args) => Console.WriteLine($"LOG [{args.Level}]: {args.Text}");), it should print the following as one of the first lines LOG [Debug]: stable-diffusion.cpp:166 - Using Vulkan backend

A second thing would be to check Backends.AvailableBackends if the vulkan backend could be loaded.

LSXAxeller commented 1 week ago

Looks like Native library get's loaded at first call to StableDiffusionCpp , I am calling StableDiffusionCpp.Progress += OnProgressChanged; in my View and then setting the active backends at loading model service, can this be evaded ? like making the native don't load until actual model loading start ?

Backends.CpuBackend.IsEnabled = false; 
Backends.CudaBackend.IsEnabled = false; 
Backends.VulkanBackend.IsEnabled = true; 
await Task.Run(() =>
{
    Model = ModelBuilder.StableDiffusion(modelPath)
        .WithMultithreading(Environment.ProcessorCount)
        .Build();

    return Task.CompletedTask;
});

Now after Vulkan backend actually enabled and loaded, I got this with all models

System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at StableDiffusion.NET.Native.<new_sd_ctx>g____PInvoke|8_0(Byte* __model_path_native, Byte* __clip_l_path_native, Byte* __t5xxl_path_native, Byte* __diffusion_model_path_native, Byte* __vae_path_native, Byte* __taesd_path_native, Byte* __control_net_path_c_str_native, Byte* __lora_model_dir_native, Byte* __embed_dir_c_str_native, Byte* __stacked_id_embed_dir_c_str_native, SByte __vae_decode_only_native, SByte __vae_tiling_native, SByte __free_params_immediately_native, Int32 __n_threads_native, Quantization __wtype_native, RngType __rng_type_native, Schedule __s_native, SByte __keep_clip_on_cpu_native, SByte __keep_control_net_cpu_native, SByte __keep_vae_on_cpu_native)
   at StableDiffusion.NET.Native.new_sd_ctx(String model_path, String clip_l_path, String t5xxl_path, String diffusion_model_path, String vae_path, String taesd_path, String control_net_path_c_str, String lora_model_dir, String embed_dir_c_str, String stacked_id_embed_dir_c_str, Boolean vae_decode_only, Boolean vae_tiling, Boolean free_params_immediately, Int32 n_threads, Quantization wtype, RngType rng_type, Schedule s, Boolean keep_clip_on_cpu, Boolean keep_control_net_cpu, Boolean keep_vae_on_cpu)
   at StableDiffusion.NET.DiffusionModel.Initialize()
   at StableDiffusion.NET.DiffusionModel..ctor(DiffusionModelParameter modelParameter)
   at StableDiffusion.NET.StableDiffusionModelBuilder.Build()
   at EGOIST.Application.Services.Image.ImageModelCoreService.<>c__DisplayClass25_1.<Switch>b__0() in E:\EGOIST\EGOIST.Application\Services\Image\ImageModelCoreService.cs:line 62
   at System.Threading.Tasks.Task`1.InnerInvoke()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)

with Log

LOG [Debug]: stable-diffusion.cpp:166  - Using Vulkan backend

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Radeon RX 580 Series (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64
LOG [Info]: stable-diffusion.cpp:195  - loading model from 'C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors'

LOG [Info]: model.cpp:793  - load C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors using safetensors format

LOG [Debug]: model.cpp:861  - init from 'C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors'

LOG [Info]: stable-diffusion.cpp:235  - Version: SD 1.x 

LOG [Info]: stable-diffusion.cpp:266  - Weight type:                 f16

LOG [Info]: stable-diffusion.cpp:267  - Conditioner weight type:     f16

LOG [Info]: stable-diffusion.cpp:268  - Diffusion model weight type: f16

LOG [Info]: stable-diffusion.cpp:269  - VAE weight type:             f16

LOG [Debug]: stable-diffusion.cpp:271  - ggml tensor size = 400 bytes

LOG [Debug]: clip.hpp:171  - vocab size: 49408

LOG [Debug]: clip.hpp:182  -  trigger word img already in vocab

LOG [Debug]: ggml_extend.hpp:1050 - clip params backend buffer size =  235.06 MB(VRAM) (196 tensors)

LOG [Debug]: ggml_extend.hpp:1050 - unet params backend buffer size =  1640.25 MB(VRAM) (686 tensors)

LOG [Debug]: ggml_extend.hpp:1050 - vae params backend buffer size =  159.68 MB(VRAM) (248 tensors)

LOG [Debug]: stable-diffusion.cpp:398  - loading weights

LOG [Debug]: model.cpp:1530 - loading tensors from C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors

[15:28:23 ERR] An error occurred while switching generation model.

DarthAffe commented 1 week ago

Yeah, having the native libraries being loaded when the log-event is subscribed is not ideal, I'll look into that.
That exception can be everything (often hardware or windows update issues), but it could also be a memory thing i guess.

I'm not sure if the example application from sd.cpp is defaulting to some other defaults when a low vram card is detected.

Could you try running it as simple as possible without loading anything extra

Backends.VulkanBackend.IsEnabled = true;
Backends.VulkanBackend.Priority = 1000;

StableDiffusionCpp.Log += (_, args) => Console.WriteLine($"LOG [{args.Level}]: {args.Text}");
StableDiffusionCpp.Progress += (_, args) => Console.WriteLine($"PROGRESS {args.Step} / {args.Steps} ({(args.Progress * 100):N2} %) {args.IterationsPerSecond:N2} it/s ({args.Time})");

DiffusionModel model = ModelBuilder.StableDiffusion(@"C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors")
                                   .KeepClipNetOnCpu()
                                   .KeepVaeOnCpu()
                                   .WithMultithreading()
                                   .Build();
IImage<ColorRGB> image = model.TextToImage("masterpiece, a nice tree", DiffusionParameter.SD1Default);

LSXAxeller commented 1 week ago

EDIT: I am gonna try your sample first, looks like my reply came out after yours by a few seconds

The System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception. was a IDE related issue, I just restarted it and model loaded successfully, Now it crashes in txt 2 img generation on finishing last step on return __PInvoke(sd_ctx, ptr, ptr2, clip_skip, cfg_scale, guidance, width, height, sample_method, sample_steps, seed, batch_count, control_cond, control_strength, style_strength, b, ptr3); at Native.cs

[LibraryImport("stable-diffusion", EntryPoint = "txt2img")]
[GeneratedCode("Microsoft.Interop.LibraryImportGenerator", "8.0.10.36612")]
[SkipLocalsInit]
internal unsafe static sd_image_t* txt2img(sd_ctx_t* sd_ctx, [MarshalAs(UnmanagedType.LPStr)] string prompt, [MarshalAs(UnmanagedType.LPStr)] string negative_prompt, int clip_skip, float cfg_scale, float guidance, int width, int height, Sampler sample_method, int sample_steps, long seed, int batch_count, sd_image_t* control_cond, float control_strength, float style_strength, [MarshalAs(UnmanagedType.I1)] bool normalize_input, [MarshalAs(UnmanagedType.LPStr)] string input_id_images_path)
{
    byte* ptr = default(byte*);
    byte* ptr2 = default(byte*);
    sbyte b = 0;
    byte* ptr3 = default(byte*);
    sd_image_t* ptr4 = default(sd_image_t*);
    AnsiStringMarshaller.ManagedToUnmanagedIn managedToUnmanagedIn = default(AnsiStringMarshaller.ManagedToUnmanagedIn);
    AnsiStringMarshaller.ManagedToUnmanagedIn managedToUnmanagedIn2 = default(AnsiStringMarshaller.ManagedToUnmanagedIn);
    AnsiStringMarshaller.ManagedToUnmanagedIn managedToUnmanagedIn3 = default(AnsiStringMarshaller.ManagedToUnmanagedIn);
    try
    {
        string managed = input_id_images_path;
        Span<byte> buffer = stackalloc byte[AnsiStringMarshaller.ManagedToUnmanagedIn.BufferSize];
        managedToUnmanagedIn.FromManaged(managed, buffer);
        b = (normalize_input ? ((sbyte)1) : ((sbyte)0));
        managed = negative_prompt;
        buffer = stackalloc byte[AnsiStringMarshaller.ManagedToUnmanagedIn.BufferSize];
        managedToUnmanagedIn2.FromManaged(managed, buffer);
        managed = prompt;
        buffer = stackalloc byte[AnsiStringMarshaller.ManagedToUnmanagedIn.BufferSize];
        managedToUnmanagedIn3.FromManaged(managed, buffer);
        ptr3 = managedToUnmanagedIn.ToUnmanaged();
        ptr2 = managedToUnmanagedIn2.ToUnmanaged();
        ptr = managedToUnmanagedIn3.ToUnmanaged();
        return __PInvoke(sd_ctx, ptr, ptr2, clip_skip, cfg_scale, guidance, width, height, sample_method, sample_steps, seed, batch_count, control_cond, control_strength, style_strength, b, ptr3);
    }
    finally
    {
        managedToUnmanagedIn.Free();
        managedToUnmanagedIn2.Free();
        managedToUnmanagedIn3.Free();
    }
    [DllImport("stable-diffusion", EntryPoint = "txt2img", ExactSpelling = true)]
    static extern unsafe sd_image_t* __PInvoke(sd_ctx_t* __sd_ctx_native, byte* __prompt_native, byte* __negative_prompt_native, int __clip_skip_native, float __cfg_scale_native, float __guidance_native, int __width_native, int __height_native, Sampler __sample_method_native, int __sample_steps_native, long __seed_native, int __batch_count_native, sd_image_t* __control_cond_native, float __control_strength_native, float __style_strength_native, sbyte __normalize_input_native, byte* __input_id_images_path_native);
}

with error

Exception thrown at 0x00007FFEB63E42A4 (stable-diffusion.dll) in EGOIST UI.exe: 0xC0000005: Access violation reading location 0x0000000000000048.
System.AccessViolationException: 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'

Log

'EGOIST UI.exe' (Win32): Loaded 'C:\Windows\System32\Windows.StateRepositoryCore.dll'. 
'EGOIST UI.exe' (Win32): Loaded 'C:\Windows\System32\DriverStore\FileRepository\u0399660.inf_amd64_d7fa3539ce499e50\B399655\amdvlk64.dll'. 
'EGOIST UI.exe' (Win32): Loaded 'C:\Windows\System32\amdihk64.dll'. Module was built without symbols.
'EGOIST UI.exe' (Win32): Loaded 'C:\Program Files (x86)\RivaTuner Statistics Server\Vulkan\RTSSVkLayer64.dll'. Module was built without symbols.
'EGOIST UI.exe' (Win32): Loaded 'C:\ProgramData\obs-studio-hook\graphics-hook64.dll'. 
[OBS] OBS_CreateDevice: could not get device address for vkQueuePresentKHR
[OBS] OBS_CreateDevice: could not get device address for vkGetSwapchainImagesKHR
[OBS] graphics-hook.dll loaded against process: EGOIST UI.exe
[OBS] (half life scientist) everything..  seems to be in order
'EGOIST UI.exe' (Win32): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\8.0.3\System.Runtime.CompilerServices.Unsafe.dll'. 
'EGOIST UI.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\8.0.3\System.Runtime.CompilerServices.Unsafe.dll'. 
Exception thrown at 0x00007FFF6CD8FABC in EGOIST UI.exe: Microsoft C++ exception: vk::OutOfDeviceMemoryError at memory location 0x000000C6B43FDE40.
Exception thrown at 0x00007FFF6CD8FABC in EGOIST UI.exe: Microsoft C++ exception: vk::SystemError at memory location 0x000000C6B43FDF80.
Exception thrown at 0x00007FFF6CD8FABC in EGOIST UI.exe: Microsoft C++ exception: vk::SystemError at memory location 0x000000C6B43FE280.
Exception thrown at 0x00007FFEB63E42A4 (stable-diffusion.dll) in EGOIST UI.exe: 0xC0000005: Access violation reading location 0x0000000000000048.

The thread '[Thread Destroyed]' (3836) has exited with code 0 (0x0).
An unhandled exception of type 'System.AccessViolationException' occurred in StableDiffusion.NET.dll
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

The thread '.NET Finalizer' (3740) has exited with code 3221225477 (0xc0000005).
The thread 14744 has exited with code 3221225477 (0xc0000005).
The program '[8456] EGOIST UI.exe' has exited with code 3221225477 (0xc0000005) 'Access violation'.

while I still have empty 1GB memory on generation and the stable-diffusion binary never trigger OutOfMemory even on high steps cound

DarthAffe commented 1 week ago

The access violation is at least somehow expected, I get that too with some models when I do not run VAE on CPU (but the same happens with the sd.cpp application). There seems to be an issue in the backend.

LSXAxeller commented 1 week ago

it actually worked, although I don't why Out of Memory raised on decoding 1 latent step while it still free but after moving the VAE and clip to CPU it worked fine, Thanks for all your efforts.

LOG [Debug]: stable-diffusion.cpp:166  - Using Vulkan backend

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Radeon RX 580 Series (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64
LOG [Info]: stable-diffusion.cpp:195  - loading model from 'C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors'

LOG [Info]: model.cpp:793  - load C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors using safetensors format

LOG [Debug]: model.cpp:861  - init from 'C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors'

LOG [Info]: stable-diffusion.cpp:235  - Version: SD 1.x 

LOG [Info]: stable-diffusion.cpp:266  - Weight type:                 f16

LOG [Info]: stable-diffusion.cpp:267  - Conditioner weight type:     f16

LOG [Info]: stable-diffusion.cpp:268  - Diffusion model weight type: f16

LOG [Info]: stable-diffusion.cpp:269  - VAE weight type:             f16

LOG [Debug]: stable-diffusion.cpp:271  - ggml tensor size = 400 bytes

LOG [Info]: stable-diffusion.cpp:313  - CLIP: Using CPU backend

LOG [Debug]: clip.hpp:171  - vocab size: 49408

LOG [Debug]: clip.hpp:182  -  trigger word img already in vocab

LOG [Debug]: ggml_extend.hpp:1050 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)

LOG [Debug]: ggml_extend.hpp:1050 - unet params backend buffer size =  1640.25 MB(VRAM) (686 tensors)

LOG [Info]: stable-diffusion.cpp:334  - VAE Autoencoder: Using CPU backend

LOG [Debug]: ggml_extend.hpp:1050 - vae params backend buffer size =  159.68 MB(RAM) (248 tensors)

LOG [Debug]: stable-diffusion.cpp:398  - loading weights

LOG [Debug]: model.cpp:1530 - loading tensors from C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors

LOG [Info]: stable-diffusion.cpp:497  - total params memory size = 2035.00MB (VRAM 1640.25MB, RAM 394.75MB): clip 235.06MB(RAM), unet 1640.25MB(VRAM), vae 159.68MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)

LOG [Info]: stable-diffusion.cpp:501  - loading model from 'C:\External\Models\Image\AbyssOrangeMix3\AbyssOrangeMix3-AOM3A1B-FP16.safetensors' completed, taking 4.89s

LOG [Info]: stable-diffusion.cpp:528  - running in eps-prediction mode

LOG [Debug]: stable-diffusion.cpp:572  - finished loaded file

LOG [Debug]: stable-diffusion.cpp:1378 - txt2img 512x512

LOG [Debug]: stable-diffusion.cpp:1127 - prompt after extract and remove lora: "(best quality, masterpiece, colorful, highest detailed) upper body photo, fashion photography of cute (Rei Ayanami), intense-red-eyes, in high detailed textured Evangelion white plugsuit, (light smile:0.3), red moonlight passing through deep black hair, (night beautiful background:1.3), (intricate details), (dynamic angle)"

LOG [Info]: stable-diffusion.cpp:655  - Attempting to apply 0 LoRAs

LOG [Info]: stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s

LOG [Debug]: conditioner.hpp:325  - parse '(best quality, masterpiece, colorful, highest detailed) upper body photo, fashion photography of cute (Rei Ayanami), intense-red-eyes, in high detailed textured Evangelion white plugsuit, (light smile:0.3), red moonlight passing through deep black hair, (night beautiful background:1.3), (intricate details), (dynamic angle)' to [['best quality, masterpiece, colorful, highest detailed', 1.1], [' upper body photo, fashion photography of cute ', 1], ['Rei Ayanami', 1.1], [', intense-red-eyes, in high detailed textured Evangelion white plugsuit, ', 1], ['light smile', 0.3], [', red moonlight passing through deep black hair, ', 1], ['night beautiful background', 1.3], [', ', 1], ['intricate details', 1.1], [', ', 1], ['dynamic angle', 1.1], ]

LOG [Debug]: clip.hpp:311  - token length: 77

LOG [Debug]: ggml_extend.hpp:1001 - clip compute buffer size: 1.40 MB(RAM)

LOG [Debug]: conditioner.hpp:453  - computing condition graph completed, taking 140 ms

LOG [Debug]: conditioner.hpp:325  - parse 'CyberRealistic_Negative_Anime-neg, moles, (grayscale:1.4), fat, ugly, bad_hands, freckles, wings' to [['CyberRealistic_Negative_Anime-neg, moles, ', 1], ['grayscale', 1.4], [', fat, ugly, bad_hands, freckles, wings', 1], ]

LOG [Debug]: clip.hpp:311  - token length: 77

LOG [Debug]: ggml_extend.hpp:1001 - clip compute buffer size: 1.40 MB(RAM)

LOG [Debug]: conditioner.hpp:453  - computing condition graph completed, taking 113 ms

LOG [Info]: stable-diffusion.cpp:1256 - get_learned_condition completed, taking 290 ms

LOG [Info]: stable-diffusion.cpp:1279 - sampling using Euler method

LOG [Info]: stable-diffusion.cpp:1283 - generating image: 1/1 - seed 299270658612420022

LOG [Debug]: ggml_extend.hpp:1001 - unet compute buffer size: 559.90 MB(VRAM)

LOG [Info]: stable-diffusion.cpp:1315 - sampling completed, taking 45.61s

LOG [Info]: stable-diffusion.cpp:1323 - generating 1 latent images completed, taking 45.61s

LOG [Info]: stable-diffusion.cpp:1326 - decoding 1 latents

LOG [Debug]: ggml_extend.hpp:1001 - vae compute buffer size: 1664.00 MB(RAM)

LOG [Debug]: stable-diffusion.cpp:987  - computing vae [mode: DECODE] graph completed, taking 27.53s

LOG [Info]: stable-diffusion.cpp:1336 - latent 1 decoded, taking 27.53s

LOG [Info]: stable-diffusion.cpp:1340 - decode_first_stage completed, taking 27.53s

LOG [Info]: stable-diffusion.cpp:1449 - txt2img completed in 73.43s

Just a last question not related to this issue, but out of curiosity, what's supported Control Net models in the core library? and how to load Control Nets, LoRa, custom VAE ? should I provide a direct path to the desired model as parameter or just the model parent directory and specify the model name in generation prompt ? I was using AUTOMATIC1111 Web UI a year ago and this library is my first attempt to use stable diffusion since then so when I found some parameters called xxxDirectory and xxxPath I got confused

DarthAffe commented 1 week ago

it actually worked, although I don't why Out of Memory raised on decoding 1 latent step while it still free but after moving the VAE and clip to CPU it worked fine,

This is not a out of memory exception, it's an access violation (a pointer pointing to some invalid memory location). This needs to be fixed in stable-diffusion.cpp.

As of the question about control nets etc. That depends on the type. It's best to check the docs over at stable-diffusion.cpp. But in short: VAE and control nets are loaded as a model (by providing a path to the model-file), for Loras and embeddings you need to provide the path where they are located and then load them in the prompt. (this should work the same as with automatic)

LSXAxeller commented 1 week ago

Thanks for help, I am gonna close the issue now.

DarthAffe / StableDiffusion.NET

Model loads to RAM and infers on CPU despite enabling Vulkan backend #30