chengchen666 commented 5 months ago

Need to implement cudaHostAlloc and cudaFreeHost to support vLLM. Test case is in: https://github.com/QuarkContainer/Quark/commit/16bf3d2ec375b54aff6789257754dc2eff27df8c

To build: nvcc -cudart shared test_cudahostalloc.cpp -o test_cudahostalloc -lcuda

To Run: ./test_cudahostalloc 1024 1024

chengchen666 commented 5 months ago

Not in high priority. It's highly possible that this API is called by NCCL. So once we finish the NCCL support, we might not need to support this API for now. This is because I don't find this API in vLLM source code, but in NCCL source code, I find it.

mehryar72 commented 5 months ago

Branch Merge Issue with mab_hostalloc

The branch named mab_hostalloc is a merge of hostalloc with multithread and nccl. The throughput test that utilizes cudahostalloc using " test_cudahostalloc" successfully executes on its first run. However, upon a second attempt, the container experiences a crash.

Log Details appearing right after crash:

[INFO] [0/60932323] unmap ptr is 4000000000, len is 1000 [INFO] [0/60932402] unmap ptr is 4000001000, len is 28f000 [INFO] [0/60932563] unmap ptr is 4000290000, len is 1a000 [INFO] [0/60932585] unmap ptr is 40002aa000, len is 2000 [INFO] [0/60932594] unmap ptr is 40002ac000, len is 3000

QuarkContainer commented 5 months ago

@mehryar72 Thank you! Would you please provide more detail repro step and it will be great to attach whole quark log.

mehryar72 commented 5 months ago

@QuarkContainer how to replicate: build quark frommab_hostalloc branch. Inside a container with quark runtime run the cudahostalloc throuput test. LD_PRELOAD=/path_to_libcudaproxy/libcudaproxy.so ./test_cudahostalloc 1024 1024 the first time the run is successfull. the second time the container gets stuck. Quark log is attached quark_log.txt

QuarkContainer commented 5 months ago

@mehryar72 I tried to build the branch mab_hostalloc but fail with following error. Looks like I need to install the nvcc library. Could you please update the steps to do that?

Compiling containerd-shim v0.3.0 (https://github.com/QuarkContainer/rust-extensions.git#b3ac82d9) Compiling quark v0.6.0 (/home/brad/rust/Quark/qvisor) error: linking with cc failed: exit status: 1 | = note: LC_ALL="C" PATH="/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/brad/.pyenv/shims:/home/brad/.pyenv/bin:/home/brad/.cargo/bin:/home/brad/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/local/go/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustc7pziLu/symbols.o" "/home/brad/rust/Quark/qvisor/../target/release/deps/quark-15c31bd88d58b28c.quark.ad02c6ded2946f8b-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/home/brad/rust/Quark/qvisor/../target/release/deps" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12/lib64" "-L" "/usr/local/cuda-12/lib64/stubs" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12.3/lib64" "-L" "/usr/local/cuda-12.3/lib64/stubs" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12/lib64" "-L" "/usr/local/cuda-12/lib64/stubs" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12.3/lib64" "-L" "/usr/local/cuda-12.3/lib64/stubs" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda/lib64" "-L" "/usr/lib/x86_64-linux-gnu" "-L" "/usr/lib/x86_64-linux-gnu/stubs" "-L" "/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bdynamic" "-lnccl" "-lcuda" "-lcudart" "-lnvidia-ml" "-lcublas" "-lcublasLt" "-Wl,-Bstatic" "/tmp/rustc7pziLu/libcompiler_builtins-8ebeba8f78436673.rlib" "-Wl,-Bdynamic" "-lcuda" "-lcublas" "-lcuda" "-lcublasLt" "-lelf" "-lcudart" "-lc" "-lcap" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/brad/rust/Quark/qvisor/../target/release/deps/quark-15c31bd88d58b28c" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-nodefaultlibs" = note: /usr/bin/ld: cannot find -lnccl: No such file or directory collect2: error: ld returned 1 exit status

my test in the branch hostalloc pass as below.

root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run --net=host --cpus=0.8 -P --runtime=quark_d --mount type=bind,source="/home/brad/rust/Quark",target=/Quark --rm -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash

========== == CUDA ==

CUDA Version 12.1.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ .

DEPRECATION NOTICE!

THIS IMAGE IS DEPRECATED and is scheduled for DELETION. https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.3543 GB/s Average throughput from device to host (cudaHostAlloc): 24.2204 GB/s root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.2447 GB/s Average throughput from device to host (cudaHostAlloc): 24.2431 GB/s

chengchen666 commented 5 months ago

Maybe we should make NCCL as an option for building quark. Because not all cuda users require for NCCL.

QuarkContainer commented 5 months ago

When test with latest GPUVirtNew branch the test code fail at weired place.

root@brad-MS-7D46:/Quark/target/release# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 failed to replaced dlopen call to libcudaproxy.so CUDA error at test_cuda.cpp:104 - �ViY

QuarkContainer commented 5 months ago

@mehryar72 @chengchen666 with PR https://github.com/QuarkContainer/Quark/pull/1315. The cudahostalloc works as below.

root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run --net=host --cpus=0.8 -P --runtime=quark_d --mount type=bind,source="/home/brad/rust/Quark",target=/Quark --rm -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash

========== == CUDA ==

CUDA Version 12.1.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ .

DEPRECATION NOTICE!

THIS IMAGE IS DEPRECATED and is scheduled for DELETION. https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.3902 GB/s Average throughput from device to host (cudaHostAlloc): 23.9117 GB/s root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.31 GB/s Average throughput from device to host (cudaHostAlloc): 23.875 GB/s

QuarkContainer / Quark

CUDA API support: cudaHostAlloc and cudaFreeHost #1304

========== == CUDA ==

========== == CUDA ==