LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.26k stars 360 forks source link

Not able to run 32k context on v1.72. #1054

Closed BrunoCelestino5 closed 2 months ago

BrunoCelestino5 commented 3 months ago

Version 1.71 used to work perfectly with Llama 3.1 8b with 32k of context and 10 GPU layers for me, but now, right after updating, it doesn't work with even 1 layer. I tested it on version 1.71, running the same model, and it was indeed working perfectly before.

I'm running it on Arch Linux with a rx 5600 xt and Vulkan.

LostRuins commented 3 months ago

What error are you getting?

BrunoCelestino5 commented 3 months ago

Automatic RoPE Scaling: Using (scale:1.000, base:6315084.5). llama_new_context_with_model: n_ctx = 32864 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 6315084.5 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: AMD Radeon RX 5600 XT KV buffer size = 128.38 MiB ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory llama_kv_cache_init: CPU KV buffer size = 3979.63 MiB llama_new_context_with_model: KV self size = 4108.00 MiB, K (f16): 2054.00 MiB, V (f16): 2054.00 MiB llama_new_context_with_model: Vulkan_Host output buffer size = 0.49 MiB ggml_vulkan: Device memory allocation of size 2384005120 failed. ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory ggml_gallocr_reserve_n: failed to allocate AMD Radeon RX 5600 XT buffer of size 2384005120 llama_new_context_with_model: failed to allocate compute buffers gpttype_load_model: error: failed to load model '/home/celes/Models/L3-8B-Stheno-v3.2-abliterated.i1-Q5_K_M.gguf' Load Text Model OK: False

LostRuins commented 3 months ago

Can you share the loading console log from 1.71 too? Also the launcher flags that you used.

LostRuins commented 3 months ago

Alright I see that that code has actually been added by @0cc4m in https://github.com/0cc4m/ggml/commit/c2c163e3290a797cb2d08106d7cca157d0811850 just recently, so I am guessing it's probably not estimating correctly on your system considering it worked in the previous version.

LostRuins commented 3 months ago

also, could you share the result of running vulkaninfo in the terminal

BrunoCelestino5 commented 3 months ago

========== VULKANINFO

Vulkan Instance Version: 1.3.279

Instance Extensions: count = 24

VK_EXT_acquire_drm_display : extension revision 1 VK_EXT_acquire_xlib_display : extension revision 1 VK_EXT_debug_report : extension revision 10 VK_EXT_debug_utils : extension revision 2 VK_EXT_direct_mode_display : extension revision 1 VK_EXT_display_surface_counter : extension revision 1 VK_EXT_headless_surface : extension revision 1 VK_EXT_surface_maintenance1 : extension revision 1 VK_EXT_swapchain_colorspace : extension revision 4 VK_KHR_device_group_creation : extension revision 1 VK_KHR_display : extension revision 23 VK_KHR_external_fence_capabilities : extension revision 1 VK_KHR_external_memory_capabilities : extension revision 1 VK_KHR_external_semaphore_capabilities : extension revision 1 VK_KHR_get_display_properties2 : extension revision 1 VK_KHR_get_physical_device_properties2 : extension revision 2 VK_KHR_get_surface_capabilities2 : extension revision 1 VK_KHR_portability_enumeration : extension revision 1 VK_KHR_surface : extension revision 25 VK_KHR_surface_protected_capabilities : extension revision 1 VK_KHR_wayland_surface : extension revision 6 VK_KHR_xcb_surface : extension revision 6 VK_KHR_xlib_surface : extension revision 6 VK_LUNARG_direct_driver_loading : extension revision 1

Instance Layers: count = 6

VK_LAYER_AMD_switchable_graphics_32 AMD switchable graphics layer 1.3.287 version 1 VK_LAYER_AMD_switchable_graphics_64 AMD switchable graphics layer 1.3.287 version 1 VK_LAYER_VALVE_steam_fossilize_32 Steam Pipeline Caching Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_fossilize_64 Steam Pipeline Caching Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_overlay_32 Steam Overlay Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_overlay_64 Steam Overlay Layer 1.3.207 version 1

Devices:

GPU0: apiVersion = 1.3.287 driverVersion = 2.0.310 vendorID = 0x1002 deviceID = 0x731f deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU deviceName = AMD Radeon RX 5600 XT driverID = DRIVER_ID_AMD_OPEN_SOURCE driverName = AMD open-source driver driverInfo = 2024.Q2.3 (LLPC) conformanceVersion = 1.3.5.2 deviceUUID = 00000000-0b00-0000-0000-000000000000 driverUUID = 414d442d-4c49-4e55-582d-445256000000 GPU1: apiVersion = 1.3.280 driverVersion = 2.0.302 vendorID = 0x1002 deviceID = 0x731f deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU deviceName = AMD Radeon RX 5600 XT driverID = DRIVER_ID_AMD_PROPRIETARY driverName = AMD proprietary driver driverInfo = (AMD proprietary shader compiler) conformanceVersion = 1.3.5.2 deviceUUID = 00000000-0b00-0000-0000-000000000000 driverUUID = 414d442d-4c49-4e55-582d-445256000000

And just to be sure, this is running on kobold 1.71

Automatic RoPE Scaling: Using (scale:1.000, base:6315084.5). llama_new_context_with_model: n_ctx = 32864 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 6315084.5 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: AMD Radeon RX 5600 XT KV buffer size = 1283.75 MiB llama_kv_cache_init: Vulkan_Host KV buffer size = 2824.25 MiB llama_new_context_with_model: KV self size = 4108.00 MiB, K (f16): 2054.00 MiB, V (f16): 2054.00 MiB llama_new_context_with_model: Vulkan_Host output buffer size = 0.49 MiB llama_new_context_with_model: AMD Radeon RX 5600 XT compute buffer size = 2273.56 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 72.19 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 246 Load Text Model OK: True Embedded KoboldAI Lite loaded. Embedded API docs loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

henk717 commented 3 months ago

Occam mentioned this is a limitation of the AMDVLK driver, he recommends using it with RADV instead

BrunoCelestino5 commented 3 months ago

I actually tried it too, installing the mesa version, and tried it now here again, but it gets the same error; and why would it affect version 1.72 and not 1.71? Both are working perfectly in the old version

LostRuins commented 3 months ago

An additional check for the max_memory_allocation_size was recently added. I'll change that check from a fatal error to a warning, since it seems that some drivers are not reporting the value correctly.

0cc4m commented 3 months ago

An additional check for the max_memory_allocation_size was recently added. I'll change that check from a fatal error to a warning, since it seems that some drivers are not reporting the value correctly.

It's a fatal error because allocating more than is allowed will otherwise go through, but lead to corruption in the results. That happened in Stable Diffusion and that's why I added the check.

Since this is Linux @BrunoCelestino5 should probably uninstall vulkan-amdgpu-pro and amdvlk and use radv instead, where this issue doesn't occur.

LostRuins commented 3 months ago

@0cc4m based on what the vulkaninfo is returning above, would it be possible to detect the difference in driver and perhaps apply a scale/replacement of this value? Or is AMDVLK just a bad driver to try to support.

LostRuins commented 3 months ago

@BrunoCelestino5 when you previously used 1.71, all your responses were coherent?

BrunoCelestino5 commented 3 months ago

Well, no, I don't think so. Using RADV, I did here a couple of tests on v1.71 with 32k context and v1.72 with 8k, both running a model designed for 32k, and v1.71 gave me some kinda bad responses compared to v1.72 using smaller prompts; many of them not exactly answering what I asked or proposed them to complete. Although both models eventually lacked coherence, but more with v1.71.

Bigger prompts given to v1.71 with 32k context gave me some reeeally bad responses, which I don't know if is to be expected from the model or not. I tested the same with a llama3-8b model designed for 8k context and the same happened, but this one I guess would be expected since I'm giving 32k to an 8k context design model.

May this be related then?

LostRuins commented 3 months ago

When @0cc4m talks about memory corruption they mean total rubbish responses being returned, not "kinda bad". If the response is a sensical continuation to the prompt in a valid language that would be coherent.

BrunoCelestino5 commented 3 months ago

Alright, so no really incoherent answers then

LostRuins commented 2 months ago

Can you give 1.73 a try and see if it works? Also check if the output is valid.

BrunoCelestino5 commented 2 months ago

Alright, working fine now on 1.73, receiving just the 'Requested buffer size exceeds device memory allocation limit!' warning. Thanks a lot!

LostRuins commented 2 months ago

No problem, though like occam said, perhaps try swapping your drivers