Closed jasowang closed 1 month ago
To be fair, I'm not even completely sure that MoltenVK can even run llama.cpp. In any case, have you tried running without the address sanitizer?
MoltenVK can run llama.cpp, one of my colleague is able to run llama.cpp in a guest through virtio-gpu transport of vulkan on mac. I've tried without sanitizer, it crashes like:
vulkan_stream(93430,0x16d1b3000) malloc: Incorrect checksum for freed object 0x152859a00: probably modified after being freed. Corrupt value: 0x2c5d302c3531312c vulkan_stream(93430,0x16d1b3000) malloc: *** set a breakpoint in malloc_error_break to debug zsh: abort ./vulkan_stream
Thanks
virtio-gpu transport of vulkan"
On QEMU?
I'll look into the problem after I fix a segfault in vulkaninfo (who knows, maybe they're the same issue!)
virtio-gpu transport of vulkan"
On QEMU?
Yes, using Venus in the guest plus virglrender + MoltenVK on the host.
I'll look into the problem after I fix a segfault in vulkaninfo (who knows, maybe they're the same issue!)
A note is that llama.cpp can move a little bit forward if I switch back to use
commit 439c9f702cad83e5e5d7a50c7a783bff8719d773 (HEAD) Author: DUOLabs333 dvdugo333@gmail.com Date: Thu Mar 28 23:38:34 2024 -0400
Update README with more information about the minimum version needed for the Vulkan loader
But then llama.cpp reports:
ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Vulkan0 buffer (tensor size: 4096, max buffer size: 0) main: error: failed to load model '/home/devel/git/qwen1_5-0_5b-chat-q2_k.gguf'
which seems something wrong when reporting max buffer size.
Thanks
Yes, using Venus in the guest plus virglrender + MoltenVK on the host.
Interesting --- that was my original plan, but I gave up on it seeing the amount of changes needed for it to be compatible with MacOS. Is this work public anywhere?
which seems something wrong when reporting max buffer size.
Yeah, this seems like a similar issue that was fixed by a later commit (I saw the same error with wezterm).
Yes, using Venus in the guest plus virglrender + MoltenVK on the host.
Interesting --- that was my original plan, but I gave up on it seeing the amount of changes needed for it to be compatible with MacOS. Is this work public anywhere?
See this:
https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/
which seems something wrong when reporting max buffer size.
Yeah, this seems like a similar issue that was fixed by a later commit (I saw the same error with wezterm).
Btw, when llama.cpp is trying to upload models which could be several GB. In this case, trying to serialize it via JSON and deserialize it via simdjson seems to be slow.
Thanks
Btw, when llama.cpp is trying to upload models which could be several GB. In this case, trying to serialize it via JSON and deserialize it via simdjson seems to be slow.
This is a fundamental problem with my current approach --- since we are using qemu's vmnet device for communication between host and guest, we are limited to the speed of the network, which isn't great. TAP devices are much better (you can see the difference when running something like vkquake), and gvisor-tap-vsock
isn't too bad either (better than vmnet, but worse than tap).
However, none of them are truly fast enough to get good performance (if I recall, tap devices get around ~4GB/s on MacOS, while we should be getting 10GB/s if we want vulkan-stream
to really be feasible).
There is some work to port vhost-user-vsock to MacOS, which should enable much faster speeds, but I wouldn't hold my breath (at least for a while) --- vhost-user-vsock
currently assumes that it will only run on a Linux host, which means that making it available on MacOS hosts is a non-trivial undertaking.
https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/
Oh, I thought you were referring to something else: this isn't running on QEMU, but on another virtualization project.
Btw, when llama.cpp is trying to upload models which could be several GB. In this case, trying to serialize it via JSON and deserialize it via simdjson seems to be slow.
This is a fundamental problem with my current approach --- since we are using qemu's vmnet device for communication between host and guest, we are limited to the speed of the network, which isn't great. TAP devices are much better (you can see the difference when running something like vkquake), and
gvisor-tap-vsock
isn't too bad either (better than vmnet, but worse than tap).
It seems it's not hard to make the server code run on top of Linux. So this project will have a broader use cases.
However, none of them are truly fast enough to get good performance (if I recall, tap devices get around ~4GB/s on MacOS, while we should be getting 10GB/s if we want
vulkan-stream
to really be feasible).
Probably, but my experiment is not a classical virtualization but trying to make use of vulkan-stream to make use of a remote (GP)GPU. I've done some hack to bypass the tensor size check and I found the server takes a lot of time in simdjson's iterating for a buffer around 100MB.
Network issue could be addressed by high speed network like RDMA or TCP device memory (DMA direct from NIC to GPU).
There is some work to port vhost-user-vsock to MacOS, which should enable much faster speeds, but I wouldn't hold my breath (at least for a while) ---
vhost-user-vsock
currently assumes that it will only run on a Linux host, which means that making it available on MacOS hosts is a non-trivial undertaking.
Or just reuse vhost-user-net.
https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/
Oh, I thought it was something else: this isn't running on QEMU, but on another virtualization project.
True, but it could be used by Qemu technically.
Thanks
It seems it's not hard to make the server code run on top of Linux. So this project will have a broader use cases.
That was/is the goal --- allowing users to access GPUs across a network/socket.
Probably, but my experiment is not a classical virtualization but trying to make use of vulkan-stream to make use of a remote (GP)GPU.
That's great --- this is definitely what I'm working towards (if I could get all of the bugs worked out first :))
lot of time in simdjson's iterating for a buffer around 100MB.
Are you sure it's simdjson, and not the networking code around it? simdjson should be quite fast, even for large files (in fact, it's optimized for large files, and less for small ones).
Network issue could be addressed by high speed network like RDMA or TCP device memory (DMA direct from NIC to GPU).
Are any of these accessible to MacOS? I'll be honest, I don't know much about the SOTA in this area.
vhost-user-net
From what I could find, vhost-user-net is also Linux-specific. Is there a specific project that I should look at?
It seems it's not hard to make the server code run on top of Linux. So this project will have a broader use cases.
That was/is the goal --- allowing users to access GPUs across a network/socket.
Great.
Probably, but my experiment is not a classical virtualization but trying to make use of vulkan-stream to make use of a remote (GP)GPU.
That's great --- this is definitely what I'm working towards (if I could get all of the bugs worked out first :))
:)
lot of time in simdjson's iterating for a buffer around 100MB.
Are you sure it's simdjson, and not the networking code around it? simdjson should be quite fast, even for large files (in fact, it's optimized for large files, and less for small ones).
Yes, I can confirm it is simdjson iteration. I can give you some logs on this.
Network issue could be addressed by high speed network like RDMA or TCP device memory (DMA direct from NIC to GPU).
Are any of these easily accessible to MacOS? I'll be honest, I don't know much about the SOTA in this area.
Nope, those are all for Linux probably.
vhost-user-net
From what I could find, vhost-user-net is also Linux-specific. Is there a specific project that I should look at?
It should be not (at least the design). Another colleague is seeking to make vhost-user to work on any POSIX system.
Here's the reference:
https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/
Thanks
Yes, I can confirm it is simdjson iteration. I can give you some logs on this.
I would like to see the logs, in that case. It's possible that I'm accessing the fields in a different order than it was sent in, which is a failure mode of simdjson --- that shouldn't happen though.
https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/
Yeah, that's the vhost-user porting work I was referring to earlier. However, even if it's merged, getting vhost-user-net to work on MacOS is still a non-trivial task (Googling also brings up vhost-net, though I'm not completely sure what that is).
As an aside, can you try to run llama.cpp now with HEAD? I got vulkaninfo to work by enabling beta extensions.
To address the speed issue, at least for the virtualization usecase, I'm thinking about sharing a block device between host and guest, so they can transfer information directly, without going through the network. There are still some details that need to be worked out though.
Yes, I can confirm it is simdjson iteration. I can give you some logs on this.
I would like to see the logs, in that case. It's possible that I'm accessing the fields in a different order than it was sent in, which is a failure mode of simdjson --- that shouldn't happen though.
Let me collect some and update here.
https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/
Yeah, that's the vhost-user porting work I was referring to earlier. However, even if it's merged, getting vhost-user-net to work on MacOS is still a non-trivial task (Googling also brings up vhost-net, though I'm not completely sure what that is).
Which part is missing? (I guess you mean eventfd?)
For vhost-net, it seems hard unless virtualization.framework support that.
Thanks
As an aside, can you try to run llama.cpp now with HEAD? I got vulkaninfo to work by enabling beta extensions.
It still crash on deserialize_struct() [1]
To address the speed issue, at least for the virtualization usecase, I'm thinking about sharing a block device between host and guest, so they can transfer information directly, without going through the network. There are still some details that need to be worked out though.
I wonder what's the advantage to just reuse the virtio-gpu here? Or something like a shared memory device (virtio-pmem or virtio-mem). Block device would still be slow for things like page cache and block layers.
[1]
================================================================= ==280==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x00010c817f70 at pc 0x000105774c20 bp 0x00016bc2da40 sp 0x00016bc2da38 WRITE of size 1 at 0x00010c817f70 thread T1
#1 0x104c18594 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1760::operator()() const Serialization.cpp:24131
#2 0x104c17fe8 in deserialize_struct(boost::json::object&, VkExtensionProperties&) Serialization.cpp:24128
#3 0x104942c14 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const::'lambda'()::operator()() const Commands.cpp:1556
#4 0x104293a98 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const Commands.cpp:1554
#5 0x104291554 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&) Commands.cpp:1548
#6 0x104927a38 in handle_command(boost::json::object) Commands.cpp:43334
#7 0x105ff7e74 in handleConnection(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*) Server.cpp:58
#8 0x10602daf4 in decltype(std::declval<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*)>()(std::declval<asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>())) std::__1::__invoke[abi:v15006]<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>(void (*&&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&&) invoke.h:394
#9 0x10602da44 in void std::__1::__thread_execute[abi:v15006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>&, std::__1::__tuple_indices<2ul>) thread:290
#10 0x10602d470 in void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>>(void*) thread:301
#11 0x19eabaf90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
#12 0x19eab5d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
0x00010c817f70 is located 0 bytes to the right of 23920-byte region [0x00010c812200,0x00010c817f70) allocated by thread T1 here:
#1 0x104293694 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const Commands.cpp:1551
#2 0x104291554 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&) Commands.cpp:1548
#3 0x104927a38 in handle_command(boost::json::object) Commands.cpp:43334
#4 0x105ff7e74 in handleConnection(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*) Server.cpp:58
#5 0x10602daf4 in decltype(std::declval<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*)>()(std::declval<asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>())) std::__1::__invoke[abi:v15006]<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>(void (*&&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&&) invoke.h:394
#6 0x10602da44 in void std::__1::__thread_execute[abi:v15006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>&, std::__1::__tuple_indices<2ul>) thread:290
#7 0x10602d470 in void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>>(void*) thread:301
#8 0x19eabaf90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
#9 0x19eab5d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
Thread T1 created by T0 here:
#1 0x105fb3628 in std::__1::__libcpp_thread_create[abi:v15006](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:376
#2 0x10602d0b0 in std::__1::thread::thread<void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&, void>(void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&) thread:317
#3 0x105ffb21c in std::__1::thread::thread<void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&, void>(void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&) thread:309
#4 0x105fface0 in startServer() Server.cpp:79
#5 0x104929f74 in main Commands.cpp:127723
#6 0x19e7320dc (<unknown module>)
SUMMARY: AddressSanitizer: heap-buffer-overflow Serialization.cpp:24135 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1760::operator()() const::'lambda'()::operator()() const Shadow bytes around the buggy address: 0x007021922f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x007021922fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[fa]fa 0x007021922ff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==280==ABORTING zsh: abort ./vulkan_stream
I wonder what's the advantage to just reuse the virtio-gpu here? Or something like a shared memory device (virtio-pmem or virtio-mem).
A block device will allow for a separate channel of communication that is much easier to reason about than trying to communicate over the RAM of a guest (MacOS hosts do not have access to ivshmem).
Could you compile with DEBUG=1 and upload the dump (along with the stdout)?
Yes, I can confirm it is simdjson iteration. I can give you some logs on this.
I would like to see the logs, in that case. It's possible that I'm accessing the fields in a different order than it was sent in, which is a failure mode of simdjson --- that shouldn't happen though.
Let me collect some and update here.
Here it is: adding the following line to Server.cpp:
std::cerr << "simd json iterate!" << std::endl;
auto line=std::string_view(input,input_size);
std::cerr << "input size "<< input_size << "line:" << line << std::endl;
auto simdjson_line=simdjson::padded_string(line);
std::time_t result = std::time(nullptr);
std::cerr << "Time before iterate: " << std::asctime(std::localtime(&result));
auto doc=curr->simdparser.iterate(simdjson_line);
result = std::time(nullptr);
std::cerr << "The after iterate: " << std::asctime(std::localtime(&result));
I get
input size 170168470 line:{"devicememory":5047288640,"mem":5101142016,"hashes":[],"lengths":[127626240],"starts":[0],"buffers":...,"unmap":false,"stream_type":0,"uuid":358172} Time before iterate: Thu May 30 09:41:10 2024 The after iterate: Thu May 30 09:41:19 2024
The buffer is about 170MB and it takes about 9 seconds to finish the iterate. Is this expected?
If we can avoid iterating large buffers, it would be much faster.
Thanks
https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/
Yeah, that's the vhost-user porting work I was referring to earlier. However, even if it's merged, getting vhost-user-net to work on MacOS is still a non-trivial task (Googling also brings up vhost-net, though I'm not completely sure what that is).
Which part is missing? (I guess you mean eventfd?)
For vhost-net, it seems hard unless virtualization.framework support that.
Thanks
I wonder what's the advantage to just reuse the virtio-gpu here? Or something like a shared memory device (virtio-pmem or virtio-mem).
A block device will allow for a separate channel of communication that is much easier to reason about than trying to communicate over the RAM of a guest (MacOS hosts do not have access to ivshmem).
Could you compile with DEBUG=1 and upload the dump (along with the stdout)?
See attachment. typescript.txt
Interesting --- the crash happens at a place where there shouldn't be a crash. I'll have to use a smaller model though for testing.
Interesting --- the crash happens at a place where there shouldn't be a crash. I'll have to use a smaller model though for testing.
I switch to a very tiny model which is 29MB https://huggingface.co/ChristianAzinn/gte-small-gguf/blob/main/gte-small.Q5_0.gguf
But I get the same crash.
Thanks
Interesting --- on my machine, llama.cpp works fine (I compiled with vulkan support, verified that it's using my driver, and just ran the model). I don't get any crash on my end.
I should note that this isn't the first time when a project runs on my side, but not on the person that opened the issue (see #1).
However, after some more testing, I can get the same error to happen sometimes on the host (you didn't state whether the original error was on the server or the client) --- however, it is hard to debug as it is inconsistent between runs.
However, after some more testing, I can get the same error to happen sometimes on the host (you didn't state whether the original error was on the server or the client) --- however, it is hard to debug as it is inconsistent between runs.
The crash happens on the server side.
One interesting thing is that, llama.cpp doesn't crash or report error if I just run a model. The crash/errors only happen when I want to run llama-bench.
Thanks
Interesting --- on my machine, llama.cpp works fine (I compiled with vulkan support, verified that it's using my driver, and just ran the model). I don't get any crash on my end.
Are you testing with llama-bench?
Thanks
I should note that this isn't the first time when a project runs on my side, but not on the person that opened the issue (see #1).
A possible reason is the different versions of various layers.
Thanks
With llama-bench, I can successfully get to the first line: Vulkan0: Apple M1 | uma: 1 | fp16: 1 | warp size: 32 | bert 33M Q5_0 | 26.78 MiB | 33.21 M | Vulkan | 99 | pp512 | 347.42 ± 7.18 |
After that, it hangs --- I can't tell whether this is a problem with the driver, as using llvmpipe exhibits the same behavior.
So the crash seems to be a out of bound access, I hack deserialize_struct() like this and the llama-bench can finish with HEAD now:
void deserialize_struct(boost::json::object& json, VkExtensionProperties& member){
auto& extensionName_json=json["extensionName"];
[&](){
auto& arr_HbZAuVY=extensionName_json.get_array();
int max_elements = arr_HbZAuVY.size();
if (max_elements > VK_MAX_EXTENSION_NAME_SIZE - 1)
max_elements = VK_MAX_EXTENSION_NAME_SIZE - 1;
for(int QYXuVZG=0; QYXuVZG < max_elements; QYXuVZG++){
[&](){
Does this ring the bell anyhow?
Btw for the llama-bench, it just because other test requires a lot of time to run with vulkan-stream. E.g I can finish the llama -bench in about 1.5 hours:
# VK_ICD_FILENAMES=/home/devel/git/vulkan-stream/stream_icd.aarch64.json ./build/bin/llama-bench -m ../gte-small.
Q5_0.gguf
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Apple M1 | uma: 1 | fp16: 1 | warp size: 32
| bert 33M Q5_0 | 26.78 MiB | 33.21 M | Vulkan | 99 | pp512 | 132.39 ± 17.11 |
| bert 33M Q5_0 | 26.78 MiB | 33.21 M | Vulkan | 99 | tg128 | 0.25 ± 0.03 |
build: 271ff3fc (3020)
Thanks
Hmmm, that shouldn't matter --- otherwise, that would mean that MoltenVK is not up to spec.
Btw, I get the following errors from the MoltenVK in the server:
[mvk-error] VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped. Call vkMapMemory() first.
DeviceMemory unmapping in progress...
[mvk-error] VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped. Call vkMapMemory() first.
[mvk-error] VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped. Call vkMapMemory() first.
DeviceMemory unmapping in progress...
DeviceMemory unmapping in progress...
Thanks
Yeah, you can ignore those --- I don't check whether the memory is mapped before unmapping.
I think I see what the problem is, but I'll have to check it out when I get home.
In any case, have you tried running without the address sanitizer?
I know you posted this already, but can you post the full log of the crash without the address sanitizer? I just want to verify it's the same bug.
I created a fix --- I'll push it when I have access to Internet again.
Can you confirm that HEAD fixes the issue?
I also fixed the issue involving VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped
.
Can you confirm that HEAD fixes the issue?
Yes, I confirm it fixes the issue.
Thanks
I also fixed the issue involving
VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped
.
I can still see this in the server side.
Thanks
Sorry, I was using the wrong map --- try again now.
I'll close this issue now. Make a new one if you are still experiencing the same issue.
Server crashes when I'm trying to run llama.cpp bench.
Steps to reproduce:
1) git clone https://github.com/ggerganov/llama.cpp.git 2) build llama.cpp via Vulkan support according to https://github.com/ggerganov/llama.cpp 3) download qwen1_5-0_5b-chat-q2_k.gguf through https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF 4) enable address sanitizer on the server side: FLAGS=FLAGS + ["-fsanitize=address"] 5) build server and client 6) run llama bench
VK_ICD_FILENAMES=/home/devel/git/vulkan-stream/stream_icd.aarch64.json ./build/bin/llama-bench -m /home/devel/git/qwen1_5-0_5b-chat-q2_k.gguf -n 1
7) sanitizer report the crash on the server: