jasowang commented 1 month ago

Server crashes when I'm trying to run llama.cpp bench.

Steps to reproduce:

1) git clone https://github.com/ggerganov/llama.cpp.git 2) build llama.cpp via Vulkan support according to https://github.com/ggerganov/llama.cpp 3) download qwen1_5-0_5b-chat-q2_k.gguf through https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF 4) enable address sanitizer on the server side: FLAGS=FLAGS + ["-fsanitize=address"] 5) build server and client 6) run llama bench

VK_ICD_FILENAMES=/home/devel/git/vulkan-stream/stream_icd.aarch64.json ./build/bin/llama-bench -m /home/devel/git/qwen1_5-0_5b-chat-q2_k.gguf -n 1

7) sanitizer report the crash on the server:

================================================================= ==92736==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x00010a617f70 at pc 0x000103843914 bp 0x00016db19a40 sp 0x00016db19a38 WRITE of size 1 at 0x00010a617f70 thread T1

0 0x103843910 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1740::operator()() const::'lambda'()::operator()() const Serialization.cpp:23854
#1 0x102d10e24 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1740::operator()() const Serialization.cpp:23850
#2 0x102d10878 in deserialize_struct(boost::json::object&, VkExtensionProperties&) Serialization.cpp:23847
#3 0x102a42d74 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const::'lambda'()::operator()() const Commands.cpp:1556
#4 0x1023a8ef0 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const Commands.cpp:1554
#5 0x1023a69ac in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&) Commands.cpp:1548
#6 0x102a27bec in handle_command(boost::json::object) Commands.cpp:42801
#7 0x1040adc38 in handleConnection(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*) Server.cpp:58
#8 0x1040e38b8 in decltype(std::declval<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*)>()(std::declval<asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>())) std::__1::__invoke[abi:v15006]<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>(void (*&&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&&) invoke.h:394
#9 0x1040e3808 in void std::__1::__thread_execute[abi:v15006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>&, std::__1::__tuple_indices<2ul>) thread:290
#10 0x1040e3234 in void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>>(void*) thread:301
#11 0x19eabaf90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
#12 0x19eab5d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
0x00010a617f70 is located 0 bytes to the right of 23920-byte region [0x00010a612200,0x00010a617f70) allocated by thread T1 here:

0 0x105b72e68 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x42e68) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
#1 0x1023a8aec in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const Commands.cpp:1551
#2 0x1023a69ac in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&) Commands.cpp:1548
#3 0x102a27bec in handle_command(boost::json::object) Commands.cpp:42801
#4 0x1040adc38 in handleConnection(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*) Server.cpp:58
#5 0x1040e38b8 in decltype(std::declval<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*)>()(std::declval<asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>())) std::__1::__invoke[abi:v15006]<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>(void (*&&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&&) invoke.h:394
#6 0x1040e3808 in void std::__1::__thread_execute[abi:v15006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>&, std::__1::__tuple_indices<2ul>) thread:290
#7 0x1040e3234 in void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>>(void*) thread:301
#8 0x19eabaf90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
#9 0x19eab5d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
Thread T1 created by T0 here:

0 0x105b6cd2c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3cd2c) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)
#1 0x1040693ec in std::__1::__libcpp_thread_create[abi:v15006](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:376
#2 0x1040e2e74 in std::__1::thread::thread<void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&, void>(void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&) thread:317
#3 0x1040b0fe0 in std::__1::thread::thread<void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&, void>(void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&) thread:309
#4 0x1040b0aa4 in startServer() Server.cpp:79
#5 0x102a2a0d4 in main Commands.cpp:126143
#6 0x19e7320dc  (<unknown module>)
SUMMARY: AddressSanitizer: heap-buffer-overflow Serialization.cpp:23854 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1740::operator()() const::'lambda'()::operator()() const Shadow bytes around the buggy address: 0x0070214e2f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0070214e2fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0070214e2fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0070214e2fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0070214e2fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0070214e2fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[fa]fa 0x0070214e2ff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0070214e3000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0070214e3010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0070214e3020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0070214e3030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==92736==ABORTING zsh: abort ./vulkan_stream

DUOLabs333 commented 1 month ago

To be fair, I'm not even completely sure that MoltenVK can even run llama.cpp. In any case, have you tried running without the address sanitizer?

jasowang commented 1 month ago

MoltenVK can run llama.cpp, one of my colleague is able to run llama.cpp in a guest through virtio-gpu transport of vulkan on mac. I've tried without sanitizer, it crashes like:

vulkan_stream(93430,0x16d1b3000) malloc: Incorrect checksum for freed object 0x152859a00: probably modified after being freed. Corrupt value: 0x2c5d302c3531312c vulkan_stream(93430,0x16d1b3000) malloc: *** set a breakpoint in malloc_error_break to debug zsh: abort ./vulkan_stream

Thanks

DUOLabs333 commented 1 month ago

virtio-gpu transport of vulkan"

On QEMU?

I'll look into the problem after I fix a segfault in vulkaninfo (who knows, maybe they're the same issue!)

jasowang commented 1 month ago

virtio-gpu transport of vulkan"

On QEMU?

Yes, using Venus in the guest plus virglrender + MoltenVK on the host.

I'll look into the problem after I fix a segfault in vulkaninfo (who knows, maybe they're the same issue!)

A note is that llama.cpp can move a little bit forward if I switch back to use

commit 439c9f702cad83e5e5d7a50c7a783bff8719d773 (HEAD) Author: DUOLabs333 dvdugo333@gmail.com Date: Thu Mar 28 23:38:34 2024 -0400

Update README with more information about the minimum version needed for the Vulkan loader

But then llama.cpp reports:

ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Vulkan0 buffer (tensor size: 4096, max buffer size: 0) main: error: failed to load model '/home/devel/git/qwen1_5-0_5b-chat-q2_k.gguf'

which seems something wrong when reporting max buffer size.

Thanks

DUOLabs333 commented 1 month ago

Yes, using Venus in the guest plus virglrender + MoltenVK on the host.

Interesting --- that was my original plan, but I gave up on it seeing the amount of changes needed for it to be compatible with MacOS. Is this work public anywhere?

which seems something wrong when reporting max buffer size.

Yeah, this seems like a similar issue that was fixed by a later commit (I saw the same error with wezterm).

jasowang commented 1 month ago

Yes, using Venus in the guest plus virglrender + MoltenVK on the host.

Interesting --- that was my original plan, but I gave up on it seeing the amount of changes needed for it to be compatible with MacOS. Is this work public anywhere?

See this:

https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/

which seems something wrong when reporting max buffer size.

Yeah, this seems like a similar issue that was fixed by a later commit (I saw the same error with wezterm).

Btw, when llama.cpp is trying to upload models which could be several GB. In this case, trying to serialize it via JSON and deserialize it via simdjson seems to be slow.

Thanks

DUOLabs333 commented 1 month ago

Btw, when llama.cpp is trying to upload models which could be several GB. In this case, trying to serialize it via JSON and deserialize it via simdjson seems to be slow.

This is a fundamental problem with my current approach --- since we are using qemu's vmnet device for communication between host and guest, we are limited to the speed of the network, which isn't great. TAP devices are much better (you can see the difference when running something like vkquake), and gvisor-tap-vsock isn't too bad either (better than vmnet, but worse than tap).

However, none of them are truly fast enough to get good performance (if I recall, tap devices get around ~4GB/s on MacOS, while we should be getting 10GB/s if we want vulkan-stream to really be feasible).

There is some work to port vhost-user-vsock to MacOS, which should enable much faster speeds, but I wouldn't hold my breath (at least for a while) --- vhost-user-vsock currently assumes that it will only run on a Linux host, which means that making it available on MacOS hosts is a non-trivial undertaking.

https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/

Oh, I thought you were referring to something else: this isn't running on QEMU, but on another virtualization project.

jasowang commented 1 month ago

Btw, when llama.cpp is trying to upload models which could be several GB. In this case, trying to serialize it via JSON and deserialize it via simdjson seems to be slow.

This is a fundamental problem with my current approach --- since we are using qemu's vmnet device for communication between host and guest, we are limited to the speed of the network, which isn't great. TAP devices are much better (you can see the difference when running something like vkquake), and gvisor-tap-vsock isn't too bad either (better than vmnet, but worse than tap).

It seems it's not hard to make the server code run on top of Linux. So this project will have a broader use cases.

However, none of them are truly fast enough to get good performance (if I recall, tap devices get around ~4GB/s on MacOS, while we should be getting 10GB/s if we want vulkan-stream to really be feasible).

Probably, but my experiment is not a classical virtualization but trying to make use of vulkan-stream to make use of a remote (GP)GPU. I've done some hack to bypass the tensor size check and I found the server takes a lot of time in simdjson's iterating for a buffer around 100MB.

Network issue could be addressed by high speed network like RDMA or TCP device memory (DMA direct from NIC to GPU).

There is some work to port vhost-user-vsock to MacOS, which should enable much faster speeds, but I wouldn't hold my breath (at least for a while) --- vhost-user-vsock currently assumes that it will only run on a Linux host, which means that making it available on MacOS hosts is a non-trivial undertaking.

Or just reuse vhost-user-net.

https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/

Oh, I thought it was something else: this isn't running on QEMU, but on another virtualization project.

True, but it could be used by Qemu technically.

Thanks

DUOLabs333 commented 1 month ago

It seems it's not hard to make the server code run on top of Linux. So this project will have a broader use cases.

That was/is the goal --- allowing users to access GPUs across a network/socket.

Probably, but my experiment is not a classical virtualization but trying to make use of vulkan-stream to make use of a remote (GP)GPU.

That's great --- this is definitely what I'm working towards (if I could get all of the bugs worked out first :))

lot of time in simdjson's iterating for a buffer around 100MB.

Are you sure it's simdjson, and not the networking code around it? simdjson should be quite fast, even for large files (in fact, it's optimized for large files, and less for small ones).

Network issue could be addressed by high speed network like RDMA or TCP device memory (DMA direct from NIC to GPU).

Are any of these accessible to MacOS? I'll be honest, I don't know much about the SOTA in this area.

vhost-user-net

From what I could find, vhost-user-net is also Linux-specific. Is there a specific project that I should look at?

jasowang commented 1 month ago

It seems it's not hard to make the server code run on top of Linux. So this project will have a broader use cases.

That was/is the goal --- allowing users to access GPUs across a network/socket.

Great.

Probably, but my experiment is not a classical virtualization but trying to make use of vulkan-stream to make use of a remote (GP)GPU.

That's great --- this is definitely what I'm working towards (if I could get all of the bugs worked out first :))

:)

lot of time in simdjson's iterating for a buffer around 100MB.

Are you sure it's simdjson, and not the networking code around it? simdjson should be quite fast, even for large files (in fact, it's optimized for large files, and less for small ones).

Yes, I can confirm it is simdjson iteration. I can give you some logs on this.

Network issue could be addressed by high speed network like RDMA or TCP device memory (DMA direct from NIC to GPU).

Are any of these easily accessible to MacOS? I'll be honest, I don't know much about the SOTA in this area.

Nope, those are all for Linux probably.

vhost-user-net

From what I could find, vhost-user-net is also Linux-specific. Is there a specific project that I should look at?

It should be not (at least the design). Another colleague is seeking to make vhost-user to work on any POSIX system.

Here's the reference:

https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/

Thanks

DUOLabs333 commented 1 month ago

Yes, I can confirm it is simdjson iteration. I can give you some logs on this.

I would like to see the logs, in that case. It's possible that I'm accessing the fields in a different order than it was sent in, which is a failure mode of simdjson --- that shouldn't happen though.

https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/

Yeah, that's the vhost-user porting work I was referring to earlier. However, even if it's merged, getting vhost-user-net to work on MacOS is still a non-trivial task (Googling also brings up vhost-net, though I'm not completely sure what that is).

DUOLabs333 commented 1 month ago

As an aside, can you try to run llama.cpp now with HEAD? I got vulkaninfo to work by enabling beta extensions.

To address the speed issue, at least for the virtualization usecase, I'm thinking about sharing a block device between host and guest, so they can transfer information directly, without going through the network. There are still some details that need to be worked out though.

jasowang commented 1 month ago

Yes, I can confirm it is simdjson iteration. I can give you some logs on this.

I would like to see the logs, in that case. It's possible that I'm accessing the fields in a different order than it was sent in, which is a failure mode of simdjson --- that shouldn't happen though.

Let me collect some and update here.

https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/

Yeah, that's the vhost-user porting work I was referring to earlier. However, even if it's merged, getting vhost-user-net to work on MacOS is still a non-trivial task (Googling also brings up vhost-net, though I'm not completely sure what that is).

Which part is missing? (I guess you mean eventfd?)

For vhost-net, it seems hard unless virtualization.framework support that.

Thanks

jasowang commented 1 month ago

As an aside, can you try to run llama.cpp now with HEAD? I got vulkaninfo to work by enabling beta extensions.

It still crash on deserialize_struct() [1]

To address the speed issue, at least for the virtualization usecase, I'm thinking about sharing a block device between host and guest, so they can transfer information directly, without going through the network. There are still some details that need to be worked out though.

I wonder what's the advantage to just reuse the virtio-gpu here? Or something like a shared memory device (virtio-pmem or virtio-mem). Block device would still be slow for things like page cache and block layers.

[1]

================================================================= ==280==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x00010c817f70 at pc 0x000105774c20 bp 0x00016bc2da40 sp 0x00016bc2da38 WRITE of size 1 at 0x00010c817f70 thread T1

0 0x105774c1c in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1760::operator()() const::'lambda'()::operator()() const Serialization.cpp:24135

#1 0x104c18594 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1760::operator()() const Serialization.cpp:24131
#2 0x104c17fe8 in deserialize_struct(boost::json::object&, VkExtensionProperties&) Serialization.cpp:24128
#3 0x104942c14 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const::'lambda'()::operator()() const Commands.cpp:1556
#4 0x104293a98 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const Commands.cpp:1554
#5 0x104291554 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&) Commands.cpp:1548
#6 0x104927a38 in handle_command(boost::json::object) Commands.cpp:43334
#7 0x105ff7e74 in handleConnection(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*) Server.cpp:58
#8 0x10602daf4 in decltype(std::declval<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*)>()(std::declval<asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>())) std::__1::__invoke[abi:v15006]<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>(void (*&&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&&) invoke.h:394
#9 0x10602da44 in void std::__1::__thread_execute[abi:v15006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>&, std::__1::__tuple_indices<2ul>) thread:290
#10 0x10602d470 in void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>>(void*) thread:301
#11 0x19eabaf90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
#12 0x19eab5d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)

0x00010c817f70 is located 0 bytes to the right of 23920-byte region [0x00010c812200,0x00010c817f70) allocated by thread T1 here:

0 0x107aeee68 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x42e68) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)

#1 0x104293694 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&)::$_103::operator()() const Commands.cpp:1551
#2 0x104291554 in handle_vkEnumerateDeviceExtensionProperties(boost::json::object&) Commands.cpp:1548
#3 0x104927a38 in handle_command(boost::json::object) Commands.cpp:43334
#4 0x105ff7e74 in handleConnection(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*) Server.cpp:58
#5 0x10602daf4 in decltype(std::declval<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*)>()(std::declval<asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>())) std::__1::__invoke[abi:v15006]<void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>(void (*&&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&&) invoke.h:394
#6 0x10602da44 in void std::__1::__thread_execute[abi:v15006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*, 2ul>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>&, std::__1::__tuple_indices<2ul>) thread:290
#7 0x10602d470 in void* std::__1::__thread_proxy[abi:v15006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (*)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*>>(void*) thread:301
#8 0x19eabaf90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)
#9 0x19eab5d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30) (BuildId: 45239f06cc5336d099337776ac7ea2fa32000000200000000100000000040e00)

Thread T1 created by T0 here:

0 0x107ae8d2c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3cd2c) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)

#1 0x105fb3628 in std::__1::__libcpp_thread_create[abi:v15006](_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:376
#2 0x10602d0b0 in std::__1::thread::thread<void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&, void>(void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&) thread:317
#3 0x105ffb21c in std::__1::thread::thread<void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&, void>(void (&)(asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*), asio::basic_stream_socket<asio::ip::tcp, asio::any_io_executor>*&) thread:309
#4 0x105fface0 in startServer() Server.cpp:79
#5 0x104929f74 in main Commands.cpp:127723
#6 0x19e7320dc  (<unknown module>)

SUMMARY: AddressSanitizer: heap-buffer-overflow Serialization.cpp:24135 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1760::operator()() const::'lambda'()::operator()() const Shadow bytes around the buggy address: 0x007021922f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x007021922fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x007021922fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[fa]fa 0x007021922ff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x007021923030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==280==ABORTING zsh: abort ./vulkan_stream

DUOLabs333 commented 1 month ago

I wonder what's the advantage to just reuse the virtio-gpu here? Or something like a shared memory device (virtio-pmem or virtio-mem).

A block device will allow for a separate channel of communication that is much easier to reason about than trying to communicate over the RAM of a guest (MacOS hosts do not have access to ivshmem).

Could you compile with DEBUG=1 and upload the dump (along with the stdout)?

jasowang commented 1 month ago

Yes, I can confirm it is simdjson iteration. I can give you some logs on this.

I would like to see the logs, in that case. It's possible that I'm accessing the fields in a different order than it was sent in, which is a failure mode of simdjson --- that shouldn't happen though.

Let me collect some and update here.

Here it is: adding the following line to Server.cpp:

std::cerr << "simd json iterate!" << std::endl;
auto line=std::string_view(input,input_size);
std::cerr << "input size "<< input_size << "line:" << line << std::endl;

auto simdjson_line=simdjson::padded_string(line);
std::time_t result = std::time(nullptr);
std::cerr << "Time before iterate: " << std::asctime(std::localtime(&result));
auto doc=curr->simdparser.iterate(simdjson_line);
result = std::time(nullptr);
std::cerr << "The after iterate: " << std::asctime(std::localtime(&result));

I get

input size 170168470 line:{"devicememory":5047288640,"mem":5101142016,"hashes":[],"lengths":[127626240],"starts":[0],"buffers":...,"unmap":false,"stream_type":0,"uuid":358172} Time before iterate: Thu May 30 09:41:10 2024 The after iterate: Thu May 30 09:41:19 2024

The buffer is about 170MB and it takes about 9 seconds to finish the iterate. Is this expected?

If we can avoid iterating large buffers, it would be much faster.

Thanks

https://patchew.org/QEMU/20240528103543.145412-1-sgarzare@redhat.com/

Yeah, that's the vhost-user porting work I was referring to earlier. However, even if it's merged, getting vhost-user-net to work on MacOS is still a non-trivial task (Googling also brings up vhost-net, though I'm not completely sure what that is).

Which part is missing? (I guess you mean eventfd?)

For vhost-net, it seems hard unless virtualization.framework support that.

Thanks

jasowang commented 1 month ago

I wonder what's the advantage to just reuse the virtio-gpu here? Or something like a shared memory device (virtio-pmem or virtio-mem).

A block device will allow for a separate channel of communication that is much easier to reason about than trying to communicate over the RAM of a guest (MacOS hosts do not have access to ivshmem).

Could you compile with DEBUG=1 and upload the dump (along with the stdout)?

See attachment. typescript.txt

DUOLabs333 commented 1 month ago

Interesting --- the crash happens at a place where there shouldn't be a crash. I'll have to use a smaller model though for testing.

jasowang commented 1 month ago

Interesting --- the crash happens at a place where there shouldn't be a crash. I'll have to use a smaller model though for testing.

I switch to a very tiny model which is 29MB https://huggingface.co/ChristianAzinn/gte-small-gguf/blob/main/gte-small.Q5_0.gguf

But I get the same crash.

Thanks

DUOLabs333 commented 1 month ago

Interesting --- on my machine, llama.cpp works fine (I compiled with vulkan support, verified that it's using my driver, and just ran the model). I don't get any crash on my end.

DUOLabs333 commented 1 month ago

I should note that this isn't the first time when a project runs on my side, but not on the person that opened the issue (see #1).

DUOLabs333 commented 1 month ago

However, after some more testing, I can get the same error to happen sometimes on the host (you didn't state whether the original error was on the server or the client) --- however, it is hard to debug as it is inconsistent between runs.

jasowang commented 1 month ago

However, after some more testing, I can get the same error to happen sometimes on the host (you didn't state whether the original error was on the server or the client) --- however, it is hard to debug as it is inconsistent between runs.

The crash happens on the server side.

One interesting thing is that, llama.cpp doesn't crash or report error if I just run a model. The crash/errors only happen when I want to run llama-bench.

Thanks

jasowang commented 1 month ago

Interesting --- on my machine, llama.cpp works fine (I compiled with vulkan support, verified that it's using my driver, and just ran the model). I don't get any crash on my end.

Are you testing with llama-bench?

Thanks

jasowang commented 1 month ago

I should note that this isn't the first time when a project runs on my side, but not on the person that opened the issue (see #1).

A possible reason is the different versions of various layers.

Thanks

DUOLabs333 commented 1 month ago

With llama-bench, I can successfully get to the first line: Vulkan0: Apple M1 | uma: 1 | fp16: 1 | warp size: 32 | bert 33M Q5_0 | 26.78 MiB | 33.21 M | Vulkan | 99 | pp512 | 347.42 ± 7.18 |

After that, it hangs --- I can't tell whether this is a problem with the driver, as using llvmpipe exhibits the same behavior.

jasowang commented 1 month ago

So the crash seems to be a out of bound access, I hack deserialize_struct() like this and the llama-bench can finish with HEAD now:

void deserialize_struct(boost::json::object& json, VkExtensionProperties& member){
auto& extensionName_json=json["extensionName"];
[&](){

        auto& arr_HbZAuVY=extensionName_json.get_array();
        int max_elements = arr_HbZAuVY.size();
        if (max_elements > VK_MAX_EXTENSION_NAME_SIZE - 1)
          max_elements = VK_MAX_EXTENSION_NAME_SIZE - 1;

        for(int QYXuVZG=0; QYXuVZG < max_elements; QYXuVZG++){
            [&](){

Does this ring the bell anyhow?

Btw for the llama-bench, it just because other test requires a lot of time to run with vulkan-stream. E.g I can finish the llama -bench in about 1.5 hours:

# VK_ICD_FILENAMES=/home/devel/git/vulkan-stream/stream_icd.aarch64.json ./build/bin/llama-bench -m ../gte-small.
Q5_0.gguf
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Apple M1 | uma: 1 | fp16: 1 | warp size: 32
| bert 33M Q5_0                  |  26.78 MiB |    33.21 M | Vulkan     |  99 |         pp512 |   132.39 ± 17.11 |
| bert 33M Q5_0                  |  26.78 MiB |    33.21 M | Vulkan     |  99 |         tg128 |      0.25 ± 0.03 |

build: 271ff3fc (3020)

Thanks

DUOLabs333 commented 1 month ago

Hmmm, that shouldn't matter --- otherwise, that would mean that MoltenVK is not up to spec.

jasowang commented 1 month ago

Btw, I get the following errors from the MoltenVK in the server:

[mvk-error] VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped. Call vkMapMemory() first.
DeviceMemory unmapping in progress...
[mvk-error] VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped. Call vkMapMemory() first.
[mvk-error] VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped. Call vkMapMemory() first.
DeviceMemory unmapping in progress...
DeviceMemory unmapping in progress...

Thanks

DUOLabs333 commented 1 month ago

Yeah, you can ignore those --- I don't check whether the memory is mapped before unmapping.

DUOLabs333 commented 1 month ago

I think I see what the problem is, but I'll have to check it out when I get home.

DUOLabs333 commented 1 month ago

In any case, have you tried running without the address sanitizer?

I know you posted this already, but can you post the full log of the crash without the address sanitizer? I just want to verify it's the same bug.

DUOLabs333 commented 1 month ago

I created a fix --- I'll push it when I have access to Internet again.

DUOLabs333 commented 1 month ago

Can you confirm that HEAD fixes the issue?

DUOLabs333 commented 1 month ago

I also fixed the issue involving VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped .

jasowang commented 1 month ago

Can you confirm that HEAD fixes the issue?

Yes, I confirm it fixes the issue.

Thanks

jasowang commented 1 month ago

I also fixed the issue involving VK_ERROR_MEMORY_MAP_FAILED: Memory is not mapped .

I can still see this in the server side.

Thanks

DUOLabs333 commented 1 month ago

Sorry, I was using the wrong map --- try again now.

DUOLabs333 commented 1 month ago

I'll close this issue now. Make a new one if you are still experiencing the same issue.

DUOLabs333 / vulkan-stream

Server crash when trying to run llama.cpp #3

0 0x103843910 in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1740::operator()() const::'lambda'()::operator()() const Serialization.cpp:23854

0 0x105b72e68 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x42e68) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)

0 0x105b6cd2c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3cd2c) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)

0 0x105774c1c in deserialize_struct(boost::json::object&, VkExtensionProperties&)::$_1760::operator()() const::'lambda'()::operator()() const Serialization.cpp:24135

0 0x107aeee68 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x42e68) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)

0 0x107ae8d2c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x3cd2c) (BuildId: f0a7ac5c49bc3abc851181b6f92b308a32000000200000000100000000000b00)