Can't seem to get Tabby to run on GPU

BurnyLlama commented 1 month ago

Describe the bug

(This might be me doing something wrong, and not a bug!)

I can't seem to get Tabby to run using my GPU (Radeon 6600 XT). Not with ROC, (which I believe is unsupported for my device) nor with Vulkan — which I believe should be supported?

Whenever I run tabby (serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device vulkan) and go into the Web UI to ask something I check my CPU and GPU usage (using btop and amdgpu_top respectively) I see my CPU usage spiking and almost no effect on the GPU. (This is when running v0.15.0-r2 or compiling v0.15.0-r3 myself).

If I try to use v0.14.0 (same options) I instead get this:

2024-08-08T11:19:34.328360Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: warning: see main README.md for information on enabling GPU BLAS support

Information about your version

See above.

Information about your GPU

From vulkaninfo:

Devices:
========
GPU0:
    apiVersion         = 1.3.270
    driverVersion      = 2.0.294
    vendorID           = 0x1002
    deviceID           = 0x73ff
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = AMD Radeon RX 6600 XT
    driverID           = DRIVER_ID_AMD_PROPRIETARY
    driverName         = AMD proprietary driver
    driverInfo         = (AMD proprietary shader compiler)
    conformanceVersion = 1.3.3.1
    deviceUUID         = 00000000-2800-0000-0000-000000000000
    driverUUID         = 414d442d-4c49-4e55-582d-445256000000
GPU1:
    apiVersion         = 1.3.278
    driverVersion      = 24.1.5
    vendorID           = 0x1002
    deviceID           = 0x73ff
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = AMD Radeon RX 6600 XT (RADV NAVI23)
    driverID           = DRIVER_ID_MESA_RADV
    driverName         = radv
    driverInfo         = Mesa 24.1.5
    conformanceVersion = 1.3.0.0
    deviceUUID         = 00000000-2800-0000-0000-000000000000
    driverUUID         = 414d442d-4d45-5341-2d44-525600000000

From rocminfo (if it helps):

=====================
HSA System Attributes
=====================
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 3600 6-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 3600 6-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3600
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32763224(0x1f3ed58) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32763224(0x1f3ed58) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32763224(0x1f3ed58) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1032
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 6600 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      2048(0x800) KB
    L3:                      32768(0x8000) KB
  Chip ID:                 29695(0x73ff)
  ASIC Revision:           0(0x0)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   2900
  BDFID:                   10240
  Internal Node ID:        1
  Compute Unit:            32
  SIMDs per CU:            2
  Shader Engines:          2
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 118
  SDMA engine uCode::      76
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1032
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional context

Full terminal output when starting Tabby (v0.15.0-r3, compiled myself):

✓ $ cargo run --features vulkan --release serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device vulkan --parallelism 2
warning: function `tracing_context` is never used
  --> ee/tabby-webserver/src/hub.rs:15:4
   |
15 | fn tracing_context() -> tarpc::context::Context {
   |    ^^^^^^^^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: `tabby-webserver` (lib) generated 1 warning
warning: function `chat_completions_utoipa` is never used
  --> crates/tabby/src/routes/chat.rs:29:14
   |
29 | pub async fn chat_completions_utoipa(_request: Json<serde_json::Value>) -> Statu...
   |              ^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: `tabby` (bin "tabby") generated 1 warning
    Finished release [optimized] target(s) in 0.31s
     Running `target/release/tabby serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device vulkan --parallelism 2`
2024-08-08T11:27:47.846355Z DEBUG tabby_common::config: crates/tabby-common/src/config.rs:35: Config file /home/USER/.tabby/config.toml not found, apply default configuration
2024-08-08T11:27:48.439867Z DEBUG tabby::serve: crates/tabby/src/serve.rs:411: Starting server, this might take a few minutes...
2024-08-08T11:27:48.494017Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:124: Waiting for llama-server <embedding> to start...
2024-08-08T11:27:48.634321Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:132: llama-server <embedding> started successfully
2024-08-08T11:27:48.649300Z DEBUG tabby_common::config: crates/tabby-common/src/config.rs:35: Config file /home/USER/.tabby/config.toml not found, apply default configuration
2024-08-08T11:27:48.649713Z DEBUG tabby::services::tantivy: crates/tabby/src/services/tantivy.rs:33: Index is ready, enabling search...
2024-08-08T11:27:48.948439Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:124: Waiting for llama-server <chat> to start...
2024-08-08T11:27:49.463618Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:132: llama-server <chat> started successfully
2024-08-08T11:27:49.659853Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:124: Waiting for llama-server <completion> to start...
2024-08-08T11:27:50.422822Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:132: llama-server <completion> started successfully

████████╗ █████╗ ██████╗ ██████╗ ██╗   ██╗
╚══██╔══╝██╔══██╗██╔══██╗██╔══██╗╚██╗ ██╔╝
   ██║   ███████║██████╔╝██████╔╝ ╚████╔╝
   ██║   ██╔══██║██╔══██╗██╔══██╗  ╚██╔╝
   ██║   ██║  ██║██████╔╝██████╔╝   ██║
   ╚═╝   ╚═╝  ╚═╝╚═════╝ ╚═════╝    ╚═╝

📄 Version 0.15.0-rc.3
🚀 Listening at 0.0.0.0:8080

VladislavNekto commented 1 month ago

https://github.com/TabbyML/tabby/issues/2810

VladislavNekto commented 1 month ago

If I try to use v0.14.0 (same options) I instead get this:

That i was get too, but only at builded tabby, when i am compiling 0.14.0 by my self it's works

By the way, ROCm will work at RX6600.

Tabby only supports the use of a single GPU. To utilize multiple GPUs, you can initiate multiple Tabby instances and set CUDA_VISIBLE_DEVICES (for cuda) or HIP_VISIBLE_DEVICES (for rocm) accordingly. You can use the HSA_OVERRIDE_GFX_VERSION variable if there is a similar GPU that is supported by ROCm you can set it to that. For example for RDNA2 you can set it to 10.3.0 and to 11.0.0 for RDNA3.

BurnyLlama commented 1 month ago

Right! I never tried to build v0.14.0 myself. I will see if I have time to try that later. If so, I will come back with the results.

BurnyLlama commented 1 month ago

I realized I didn't compile for ROCm before, only for Vulkan. When I try to compile with --features=rocm:

error: failed to run custom build command for `llama-cpp-server v0.14.0 (/tmp/tabby-src/crates/llama-cpp-server)`

Caused by:
  process didn't exit successfully: `/tmp/tabby-src/target/release/build/llama-cpp-server-1b3738a0281592c2/build-script-build` (exit status: 101)
  --- stdout
  CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
  CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_TOOLCHAIN_FILE = None
  CMAKE_TOOLCHAIN_FILE = None
  CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
  CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_GENERATOR = None
  CMAKE_GENERATOR = None
  CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
  CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_PREFIX_PATH = None
  CMAKE_PREFIX_PATH = None
  CMAKE_x86_64-unknown-linux-gnu = None
  CMAKE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE = None
  CMAKE = None
  running: cd "/tmp/tabby-src/target/release/build/llama-cpp-server-40e8fadd415341a4/out/build" && CMAKE_PREFIX_PATH="" "cmake" "/tmp/tabby-src/crates/llama-cpp-server/./llama.cpp" "-DLLAMA_NATIVE=OFF" "-DBUILD_SHARED_LIBS=OFF" "-DINS_ENB=ON" "-DLLAMA_HIPBLAS=ON" "-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang" "-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++" "-DAMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1031;gfx1100;gfx1101;gfx1102;gfx1103" "-DCMAKE_INSTALL_PREFIX=/tmp/tabby-src/target/release/build/llama-cpp-server-40e8fadd415341a4/out" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"
  -- The C compiler identification is unknown
  -- The CXX compiler identification is unknown
  -- Configuring incomplete, errors occurred!

  --- stderr
  CMake Error at CMakeLists.txt:2 (project):
    The CMAKE_C_COMPILER:

      /opt/rocm/llvm/bin/clang

    is not a full path to an existing compiler tool.

    Tell CMake where to find the compiler by setting either the environment
    variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
    the compiler, or to the compiler name if it is in the PATH.

  CMake Error at CMakeLists.txt:2 (project):
    The CMAKE_CXX_COMPILER:

      /opt/rocm/llvm/bin/clang++

    is not a full path to an existing compiler tool.

    Tell CMake where to find the compiler by setting either the environment
    variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
    to the compiler, or to the compiler name if it is in the PATH.

  thread 'main' panicked at /home/USER/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cmake-0.1.50/src/lib.rs:1098:5:

  command did not execute successfully, got: exit status: 1

  build script failed, must exit now
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I use Gentoo and I should (as far as I know) have ROCm installed, but maybe I am missing something... I'll have to look into this more.

(This happens both on v0.14.0 and v0.15.0-r3...)

BurnyLlama commented 1 month ago

A bit of a hacky solution that at least gets Tabby to compile:

sudo mkdir /opt/rocm
sudo ln -sv /usr/lib/llvm/18 /opt/rocm/llvm

However, besides seeming slower to generate responses in the chat, there's no difference. It still seems to run on the CPU when using:

HSA_OVERRIDE_GFX_VERSION=10.3.0 cargo run --features rocm --release serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device rocm

I also tried adding HCC_AMDGPU_TARGET=gfx1032 to specify using my GPU and not my CPU. I don't know if that is how I am supposed to use that environment variable, but it did not work.

richard-jfc commented 1 month ago

The fix for rocm is merged, the vulkan fix is probably similar (I haven't tested it): https://github.com/TabbyML/tabby/issues/2810#issuecomment-2283356626

TabbyML / tabby

Can't seem to get Tabby to run on GPU #2811