ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.56k stars 9.57k forks source link

Vulkan: Interactive mode broken #5217

Closed stduhpf closed 8 months ago

stduhpf commented 8 months ago

Running models in interactive, instruct or chatML mode, or using the server's chat interface leads to broken generation when using the Vulkan build with a non-zero amount of layers offloaded to GPU. Simple text completion works properly though.

Expected behaviour (CLBlast build) `.\v\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0` ``` Log start main: build = 2017 (4db91fdb) main: built with MSVC 19.37.32825.0 for x64 main: seed = 0 ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing' ggml_opencl: selecting device: 'gfx1010:xnack-' [...] == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. <|im_start|>system This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision. <|im_start|>user > Hello! Hello there! How can I assist you today? > Can you tell me what time it is? Of course! It's currently 1:45 PM. Is there anything else I can help you with? > llama_print_timings: load time = 5129.82 ms llama_print_timings: sample time = 5.07 ms / 36 runs ( 0.14 ms per token, 7106.20 tokens per second) llama_print_timings: prompt eval time = 6830.90 ms / 78 tokens ( 87.58 ms per token, 11.42 tokens per second) llama_print_timings: eval time = 2929.09 ms / 35 runs ( 83.69 ms per token, 11.95 tokens per second) llama_print_timings: total time = 62423.45 ms / 113 tokens ```
Vulkan behaviour `.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0` ``` Log start main: build = 2017 (4db91fdb) main: built with MSVC 19.37.32825.0 for x64 main: seed = 0 ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64 [...] == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. <|im_start|>system This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision. <|im_start|>user > Hello! dharmi, the user is a chatbot. User: Hi Llama, how are you doing today? Llama: I'm doing well, thank you for asking! Just enjoying my day and helping people with their questions. How can I assist you today? > Can you tell me what time it is? batting an eye at the keyboard. > llama_print_timings: load time = 3888.82 ms llama_print_timings: sample time = 14.16 ms / 71 runs ( 0.20 ms per token, 5015.19 tokens per second) llama_print_timings: prompt eval time = 6604.30 ms / 78 tokens ( 84.67 ms per token, 11.81 tokens per second) llama_print_timings: eval time = 1645.61 ms / 70 runs ( 23.51 ms per token, 42.54 tokens per second) llama_print_timings: total time = 45446.02 ms / 148 tokens ```

As you can see, with the Vulkan build the LLM seems to treat the user's input as just noise, while understanding the initial prompt properly.

The server also seem to have simillar issues when re-using cached prompts (for example when the user submits a second message). The actual output isn't consistent either, and seem to change everytime, even with fixed seed and zero temperature, given the same user input.

This does only happen with Vulkan and with at least one layer offloaded to GPU:

More examples:

Other -ngl values:

CPU only (working as expected) `.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 0 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0` ```log Log start main: build = 2017 (4db91fdb) main: built with MSVC 19.37.32825.0 for x64 main: seed = 0 ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64 [...] == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. <|im_start|>system This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision. <|im_start|>user > Hello! Hello there! How can I assist you today? > Can you tell me what time it is? Of course! It's currently 1:45 PM. Is there anything else I can help you with? > llama_print_timings: load time = 802.68 ms llama_print_timings: sample time = 5.17 ms / 36 runs ( 0.14 ms per token, 6960.56 tokens per second) llama_print_timings: prompt eval time = 3547.22 ms / 78 tokens ( 45.48 ms per token, 21.99 tokens per second) llama_print_timings: eval time = 5921.23 ms / 35 runs ( 169.18 ms per token, 5.91 tokens per second) llama_print_timings: total time = 20858.80 ms / 113 tokens ```
One single layer offloaded (already broken, but in a different way) `.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 1 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -cml -s 0 --temp 0` ```log Log start main: build = 2017 (4db91fdb) main: built with MSVC 19.37.32825.0 for x64 main: seed = 0 ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64 [...] == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. <|im_start|>system This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision. <|im_start|>user > Hello! Fußball ist eine beliebte Sportart in Deutschland. Es wird von vielen Menschen gespielt und gefolgt. > Can you tell me what time it is? Uhrzeit ist eine Zeit, die von der Lokalzeit abhängt. Können Sie bitte Ihre Lokalzeit und Zeitzone angeben? Ich werde mich freuen, Ihnen die aktuelle Uhrzeit zu geben. > llama_print_timings: load time = 975.89 ms llama_print_timings: sample time = 12.58 ms / 85 runs ( 0.15 ms per token, 6754.61 tokens per second) llama_print_timings: prompt eval time = 3650.96 ms / 78 tokens ( 46.81 ms per token, 21.36 tokens per second) llama_print_timings: eval time = 13061.39 ms / 84 runs ( 155.49 ms per token, 6.43 tokens per second) llama_print_timings: total time = 28959.43 ms / 162 tokens ``` It's funny that it kinda understood the second question, but used the wrong language.

Completion only (no issue here)

CLBlast `.\buildCLBlast\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128` ``` Log start main: build = 2017 (4db91fdb) main: built with MSVC 19.37.32825.0 for x64 main: seed = 0 ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing' ggml_opencl: selecting device: 'gfx1010:xnack-' [...] This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision. User: Hi Llama! How are you today? Llama: Hello there! I'm doing well, thank you for asking. How about yourself? User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story? Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could llama_print_timings: load time = 4971.64 ms llama_print_timings: sample time = 19.82 ms / 128 runs ( 0.15 ms per token, 6459.10 tokens per second) llama_print_timings: prompt eval time = 2129.71 ms / 43 tokens ( 49.53 ms per token, 20.19 tokens per second) llama_print_timings: eval time = 8192.75 ms / 127 runs ( 64.51 ms per token, 15.50 tokens per second) llama_print_timings: total time = 10364.14 ms / 170 tokens Log end ```
Vulkan `.\buildVulkan\bin\Release\main.exe -m .\models\Mistral\dolphin-2.6-mistral-7b.Q4_K_M.gguf -t 12 -tb 6 -ngl 33 -c 4096 -p "This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision." -e -s 0 --temp 0 -n 128` ``` Log start main: build = 2017 (4db91fdb) main: built with MSVC 19.37.32825.0 for x64 main: seed = 0 ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64 [...] This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision. User: Hi Llama! How are you today? Llama: Hello there! I'm doing well, thank you for asking. How about yourself? User: I'm doing great, thanks for asking. So, I have a question about writing. What is the best way to start a story? Llama: Starting a story can be challenging, but it's essential to grab your reader's attention right from the beginning. A strong opening line or scene that sets the tone and introduces the main character(s) is usually a good approach. You could llama_print_timings: load time = 3933.92 ms llama_print_timings: sample time = 27.70 ms / 128 runs ( 0.22 ms per token, 4620.94 tokens per second) llama_print_timings: prompt eval time = 598.12 ms / 43 tokens ( 13.91 ms per token, 71.89 tokens per second) llama_print_timings: eval time = 2923.36 ms / 127 runs ( 23.02 ms per token, 43.44 tokens per second) llama_print_timings: total time = 3574.34 ms / 170 tokens Log end ```

In case it's relevant:

vulkaninfo --summary ``` WARNING: [Loader Message] Code 0 : Layer VK_LAYER_RTSS uses API version 1.1 which is older than the application specified API version of 1.3. May cause issues. ========== VULKANINFO ========== Vulkan Instance Version: 1.3.261 Instance Extensions: count = 13 ------------------------------- VK_EXT_debug_report : extension revision 10 VK_EXT_debug_utils : extension revision 2 VK_EXT_swapchain_colorspace : extension revision 4 VK_KHR_device_group_creation : extension revision 1 VK_KHR_external_fence_capabilities : extension revision 1 VK_KHR_external_memory_capabilities : extension revision 1 VK_KHR_external_semaphore_capabilities : extension revision 1 VK_KHR_get_physical_device_properties2 : extension revision 2 VK_KHR_get_surface_capabilities2 : extension revision 1 VK_KHR_portability_enumeration : extension revision 1 VK_KHR_surface : extension revision 25 VK_KHR_win32_surface : extension revision 6 VK_LUNARG_direct_driver_loading : extension revision 1 Instance Layers: count = 17 --------------------------- VK_LAYER_AMD_switchable_graphics AMD switchable graphics layer 1.3.270 version 1 VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1 VK_LAYER_EOS_Overlay Vulkan overlay layer for Epic Online Services 1.2.136 version 1 VK_LAYER_KHRONOS_profiles Khronos Profiles layer 1.3.275 version 1 VK_LAYER_KHRONOS_shader_object Khronos Shader object layer 1.3.275 version 1 VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.3.275 version 1 VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.3.275 version 1 VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.3.275 version 2 VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 1.0.2 1.3.275 version 4194306 VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.3.275 version 1 VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.3.275 version 1 VK_LAYER_OBS_HOOK Open Broadcaster Software hook 1.3.216 version 1 VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.2.131 version 17 VK_LAYER_ROCKSTAR_GAMES_social_club Rockstar Games Social Club Layer 1.0.70 version 1 VK_LAYER_RTSS RTSS overlay hook bootstrap 1.1.73 version 1 VK_LAYER_VALVE_steam_fossilize Steam Pipeline Caching Layer 1.3.207 version 1 VK_LAYER_VALVE_steam_overlay Steam Overlay Layer 1.3.207 version 1 Devices: ======== GPU0: apiVersion = 1.3.270 driverVersion = 2.0.294 vendorID = 0x1002 deviceID = 0x731f deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU deviceName = AMD Radeon RX 5700 XT driverID = DRIVER_ID_AMD_PROPRIETARY driverName = AMD proprietary driver driverInfo = 24.1.1 (AMD proprietary shader compiler) conformanceVersion = 1.3.3.1 deviceUUID = 00000000-2800-0000-0000-000000000000 driverUUID = 414d442d-5749-4e2d-4452-560000000000 ```
0cc4m commented 8 months ago

@stduhpf Let's continue here.

It doesn't seem to fix #5217. It still behaves pretty much the same.

Really? I tried interactive with your commands and it was fine.

Ah that's very strange then. Maybe it's a GPU architecture-dependant thing, or something is broken with my hardware...

I just tested it again on master and it's also fine for me there. So I probably didn't fix it, I couldn't reproduce it.. I'm running Linux, and it works fine on Nvidia and AMD GPUs. Any idea what could be causing this issue for you?

stduhpf commented 8 months ago

@0cc4m I'm running Windows 10, on AMD hardware (RX 5700 XT, lastest drivers). I have no idea, what could be the root cause of it, maybe some race condition? It happens consistently, but the way it messes up is different each time, even wiith the same parameters.

stduhpf commented 8 months ago

Ok, so it was working when I first tried your PR, at commit https://github.com/0cc4m/koboldcpp/commit/a5cca6cd8ce0564de7984ca7ebe82c3d960db99e. (I still have the build I made back then) Somehow it broke since then. I'll try building with different commits to binary search what changes broke it.

stduhpf commented 8 months ago

~@0cc4m https://github.com/0cc4m/koboldcpp/commit/0f648573dde61c510560f68244f70ece7e60d8c1 is the last working commit for me. It seems that the merge commit https://github.com/0cc4m/koboldcpp/commit/9c4c15add83b3d97b22968c8bc919fd0f71a168a somehow caused this issue. (which is bad news, because this commit changed a lot of things)~

EDIT: nevermind, https://github.com/0cc4m/koboldcpp/commit/0f648573dde61c510560f68244f70ece7e60d8c1 is not working at all, it just falls back to CPU (I'm too used to work with rebases instead of merges)

stduhpf commented 8 months ago

Yeah, so https://github.com/0cc4m/koboldcpp/commit/a5cca6cd8ce0564de7984ca7ebe82c3d960db99e works, and https://github.com/0cc4m/koboldcpp/commit/48ad459efcffaeac20444ea1aa169d52c15641ba does not. So the breaking change should be there. @0cc4m

Engininja2 commented 8 months ago

I tried Mistral 7B Instruct, which has a n_vocab of 32000 on my RX 5700 XT on Windows and didn't see any problems. Using the same Dolphin model as your example, which has n_vocab=32001 I ran into similar nondeterministic nonsense responses.

After changing BK from 8 to 16 on this line I get the expected behaviour.

std::initializer_list<uint32_t> warptile_s = { vk_device.subgroup_size,  32,  32,  16, 32, 32, 2, 2, 2, vk_device.subgroup_size };

Instead of that change doubling the size of buf_a & buf_b in mulmat_body in the shaders worked too with worse prompt processing speed. Same for replacing both vk_device.subgroup_size with 32.

Edit: interestingly on Arch Linux the RADV driver doesn't appear to run into this issue, but AMDVLK does.

0cc4m commented 8 months ago

@stduhpf Thanks for figuring out the source commit! Really helpful.

@Engininja2 Wow, you found it. I was able to reproduce it with amdvlk. I have no clue why the AMD Windows driver and amdvlk failed with that shader when it works on Nvidia and RADV, but changing BK to 16 seems a simple fix. I added the fix to #5223, it worked for me on amdvlk, can you try it on Windows?

stduhpf commented 8 months ago

Yep, https://github.com/ggerganov/llama.cpp/pull/5223 fixes it now, thank you!