Open chengchen666 opened 5 months ago
Not in high priority. It's highly possible that this API is called by NCCL. So once we finish the NCCL support, we might not need to support this API for now. This is because I don't find this API in vLLM source code, but in NCCL source code, I find it.
Branch Merge Issue with mab_hostalloc
The branch named mab_hostalloc
is a merge of hostalloc
with multithread
and nccl
. The throughput test that utilizes cudahostalloc
using " test_cudahostalloc" successfully executes on its first run. However, upon a second attempt, the container experiences a crash.
Log Details appearing right after crash:
[INFO] [0/60932323] unmap ptr is 4000000000, len is 1000 [INFO] [0/60932402] unmap ptr is 4000001000, len is 28f000 [INFO] [0/60932563] unmap ptr is 4000290000, len is 1a000 [INFO] [0/60932585] unmap ptr is 40002aa000, len is 2000 [INFO] [0/60932594] unmap ptr is 40002ac000, len is 3000
@mehryar72 Thank you! Would you please provide more detail repro step and it will be great to attach whole quark log.
@QuarkContainer
how to replicate:
build quark frommab_hostalloc
branch.
Inside a container with quark runtime run the cudahostalloc throuput test.
LD_PRELOAD=/path_to_libcudaproxy/libcudaproxy.so ./test_cudahostalloc 1024 1024
the first time the run is successfull. the second time the container gets stuck.
Quark log is attached
quark_log.txt
@mehryar72 I tried to build the branch mab_hostalloc but fail with following error. Looks like I need to install the nvcc library. Could you please update the steps to do that?
Compiling containerd-shim v0.3.0 (https://github.com/QuarkContainer/rust-extensions.git#b3ac82d9)
Compiling quark v0.6.0 (/home/brad/rust/Quark/qvisor)
error: linking with cc
failed: exit status: 1
|
= note: LC_ALL="C" PATH="/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/brad/.pyenv/shims:/home/brad/.pyenv/bin:/home/brad/.cargo/bin:/home/brad/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/local/go/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustc7pziLu/symbols.o" "/home/brad/rust/Quark/qvisor/../target/release/deps/quark-15c31bd88d58b28c.quark.ad02c6ded2946f8b-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/home/brad/rust/Quark/qvisor/../target/release/deps" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12/lib64" "-L" "/usr/local/cuda-12/lib64/stubs" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12.3/lib64" "-L" "/usr/local/cuda-12.3/lib64/stubs" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12/lib64" "-L" "/usr/local/cuda-12/lib64/stubs" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12.3/lib64" "-L" "/usr/local/cuda-12.3/lib64/stubs" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda/lib64" "-L" "/usr/lib/x86_64-linux-gnu" "-L" "/usr/lib/x86_64-linux-gnu/stubs" "-L" "/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bdynamic" "-lnccl" "-lcuda" "-lcudart" "-lnvidia-ml" "-lcublas" "-lcublasLt" "-Wl,-Bstatic" "/tmp/rustc7pziLu/libcompiler_builtins-8ebeba8f78436673.rlib" "-Wl,-Bdynamic" "-lcuda" "-lcublas" "-lcuda" "-lcublasLt" "-lelf" "-lcudart" "-lc" "-lcap" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/brad/rust/Quark/qvisor/../target/release/deps/quark-15c31bd88d58b28c" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-nodefaultlibs"
= note: /usr/bin/ld: cannot find -lnccl: No such file or directory
collect2: error: ld returned 1 exit status
my test in the branch hostalloc pass as below.
root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run --net=host --cpus=0.8 -P --runtime=quark_d --mount type=bind,source="/home/brad/rust/Quark",target=/Quark --rm -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash
CUDA Version 12.1.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ .
DEPRECATION NOTICE!
THIS IMAGE IS DEPRECATED and is scheduled for DELETION. https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.3543 GB/s Average throughput from device to host (cudaHostAlloc): 24.2204 GB/s root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.2447 GB/s Average throughput from device to host (cudaHostAlloc): 24.2431 GB/s
Maybe we should make NCCL as an option for building quark. Because not all cuda users require for NCCL.
When test with latest GPUVirtNew branch the test code fail at weired place.
root@brad-MS-7D46:/Quark/target/release# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 failed to replaced dlopen call to libcudaproxy.so CUDA error at test_cuda.cpp:104 - �ViY
@mehryar72 @chengchen666 with PR https://github.com/QuarkContainer/Quark/pull/1315. The cudahostalloc works as below.
root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run --net=host --cpus=0.8 -P --runtime=quark_d --mount type=bind,source="/home/brad/rust/Quark",target=/Quark --rm -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash
CUDA Version 12.1.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ .
DEPRECATION NOTICE!
THIS IMAGE IS DEPRECATED and is scheduled for DELETION. https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.3902 GB/s Average throughput from device to host (cudaHostAlloc): 23.9117 GB/s root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024 Average throughput from host to device (cudaHostAlloc): 22.31 GB/s Average throughput from device to host (cudaHostAlloc): 23.875 GB/s
Need to implement cudaHostAlloc and cudaFreeHost to support vLLM. Test case is in: https://github.com/QuarkContainer/Quark/commit/16bf3d2ec375b54aff6789257754dc2eff27df8c
To build:
nvcc -cudart shared test_cudahostalloc.cpp -o test_cudahostalloc -lcuda
To Run:
./test_cudahostalloc 1024 1024