TabbyML / tabby

Self-hosted AI coding assistant
https://tabbyml.com
Other
21.87k stars 997 forks source link

llama-server of vulkan backend crashes #3313

Open Vlod-github opened 3 weeks ago

Vlod-github commented 3 weeks ago

After running .\tabby.exe serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct after downloading the models, the llama-server.exe crashes.

Tabby tabby v0.18.0 and tabby v0.19.0-rc.1 tabby_x86_64-windows-msvc and tabby_x86_64-windows-msvc-vulkan It turns out that this doesn't work even for the CPU.

Environment Ryzen 5 3500U with Vega 8 Windows 10

Further, I connected these gguf models to gpt4all and they work, so the issue is with the backend. Here is the output that tabby periodically produces.

.\tabby.exe serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct
⠦     2.911 s   Starting...2024-10-24T08:12:44.082941Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:98: llama-server <embedding> exited with status code -1073741819, args: `Command { std: "C:\\Portable\\tabby_x86_64-windows-msvc\\windows-msvc-18\\llama-server.exe" "-m" "C:\\Users\\Professional\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model.gguf" "--cont-batching" "--port" "30888" "-np" "1" "--log-disable" "--ctx-size" "4096" "-ngl" "9999" "--embedding" "--ubatch-size" "4096", kill_on_drop: true }`
2024-10-24T08:12:44.085157Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
2024-10-24T08:12:44.086307Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: see main README.md for information on enabling GPU BLAS support
2024-10-24T08:12:44.087278Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\Professional\.tabby\models\TabbyML\Nomic-Embed-Text\ggml\model.gguf (version GGUF V3 (latest))
2024-10-24T08:12:44.088395Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-10-24T08:12:44.089398Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-10-24T08:12:44.091185Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-10-24T08:12:44.093033Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-10-24T08:12:44.094020Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-10-24T08:12:44.095077Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-10-24T08:12:44.096397Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-10-24T08:12:44.097685Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-10-24T08:12:44.098732Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-10-24T08:12:44.099684Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-10-24T08:12:44.102463Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-10-24T08:12:44.103693Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-10-24T08:12:44.105092Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000 2024-10-24T08:12:44.106800Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-10-24T08:12:44.107875Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-10-24T08:12:44.108943Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-10-24T08:12:44.109962Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-10-24T08:12:44.111906Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-10-24T08:12:44.114396Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-10-24T08:12:44.115595Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-10-24T08:12:44.116595Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-10-24T08:12:44.117684Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-10-24T08:12:44.119286Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-10-24T08:12:44.121779Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-10-24T08:12:44.122975Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-10-24T08:12:44.124747Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-10-24T08:12:44.126336Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-10-24T08:12:44.127402Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-10-24T08:12:44.130332Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
⠧     2.992 s   Starting...2024-10-24T08:12:44.133861Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-10-24T08:12:44.135653Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-10-24T08:12:44.137969Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-10-24T08:12:44.139536Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_merges         = 0
2024-10-24T08:12:44.140560Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-10-24T08:12:44.142767Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-10-24T08:12:44.143693Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd           = 768
2024-10-24T08:12:44.144609Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_layer          = 12
2024-10-24T08:12:44.146168Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head           = 12
2024-10-24T08:12:44.148556Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-10-24T08:12:44.149948Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_rot            = 64
2024-10-24T08:12:44.150964Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_swa            = 0
2024-10-24T08:12:44.152193Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-10-24T08:12:44.153282Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-10-24T08:12:44.154359Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-10-24T08:12:44.155342Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-10-24T08:12:44.158536Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-10-24T08:12:44.159785Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-10-24T08:12:44.161030Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-10-24T08:12:44.162013Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-10-24T08:12:44.163205Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-10-24T08:12:44.164591Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-10-24T08:12:44.165552Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-10-24T08:12:44.168462Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert         = 0
2024-10-24T08:12:44.170404Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-10-24T08:12:44.171371Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: causal attn      = 0
2024-10-24T08:12:44.172462Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: pooling type     = 1
2024-10-24T08:12:44.173364Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope type        = 2
2024-10-24T08:12:44.174266Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-10-24T08:12:44.178215Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-10-24T08:12:44.180089Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-10-24T08:12:44.181262Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-10-24T08:12:44.182262Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-10-24T08:12:44.183277Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-10-24T08:12:44.184394Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-10-24T08:12:44.185514Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-10-24T08:12:44.188994Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-10-24T08:12:44.190066Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model type       = 137M
2024-10-24T08:12:44.191004Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-10-24T08:12:44.191849Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-10-24T08:12:44.192877Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-10-24T08:12:44.194017Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-10-24T08:12:44.195117Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-10-24T08:12:44.196205Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-10-24T08:12:44.199236Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-10-24T08:12:44.200363Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-10-24T08:12:44.201409Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-10-24T08:12:44.204156Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-10-24T08:12:44.205115Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-10-24T08:12:44.208855Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-10-24T08:12:44.209985Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_print_meta: max token length = 21
2024-10-24T08:12:44.210865Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors: ggml ctx size =    0.05 MiB
2024-10-24T08:12:44.211719Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llm_load_tensors:        CPU buffer size =   138.65 MiB
2024-10-24T08:12:44.212560Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: .......................................................
2024-10-24T08:12:44.213400Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_ctx      = 4096
2024-10-24T08:12:44.214336Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_new_context_with_model: n_batch    = 2048
⠇     3.074 s   Starting...2024-10-24T08:12:44
gmatht commented 1 week ago

tabby_x86_64-windows-msvc-vulkan/tabby serve fails to ever finish loading for me too. The first problem it reports is also "not compiled with GPU offload support". It seems like Vulkan should support offloading to GPU.

Information about your version tabby 0.19.0

Information about your GPU Intel Arc A770M 16GB VRAM

'nvidia-smi' is not recognized as an internal or external command,
operable program or batch file.

Intel Driver & Support Assistant reports:

    Driver Details
    Up to date
    ProviderIntel Corporation
    Version32.0.101.6078
    Date2024-09-13

    Device Details
    Adapter CompatibilityIntel Corporation
    Video ProcessorIntel® Arc™ A770M Graphics Family
    Resolution3840 x 2160
    Bits Per Pixel32
    Number of Colors4294967296
    Refresh Rate - Current60 Hz
    Refresh Rate - Maximum240 Hz
    Refresh Rate - Minimum23 Hz
    Adapter DAC TypeInternal
    AvailabilityRunning at full power
    StatusThis device is working properly.
    LocationPCI bus 3, device 0, function 0
    Device IdPCI\VEN_8086&DEV_5690&SUBSYS_30268086&REV_08\6&1A6B8599&0&00080008

Additional context

C:\bin\offpath\tabby_x86_64-windows-msvc-vulkan>tabby serve --model StarCoder-1B --device vulkan
⠦     4.527 s   Starting...2024-11-06T09:01:10.861861Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:98: llama-server <embedding> exited with status code -1073741819, args: `Command { std: "C:\\bin\\offpath\\tabby_x86_64-windows-msvc-vulkan\\llama-server.exe" "-m" "C:\\Users\\s_pam\\.tabby\\models\\TabbyML\\Nomic-Embed-Text\\ggml\\model-00001-of-00001.gguf" "--cont-batching" "--port" "30888" "-np" "1" "--log-disable" "--ctx-size" "4096" "-ngl" "9999" "--embedding" "--ubatch-size" "4096", kill_on_drop: true }`
2024-11-06T09:01:10.862447Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: not compiled with GPU offload support, --gpu-layers option will be ignored
2024-11-06T09:01:10.862792Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: warning: see main README.md for information on enabling GPU BLAS support
2024-11-06T09:01:10.863095Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from C:\Users\s_pam\.tabby\models\TabbyML\Nomic-Embed-Text\ggml\model-00001-of-00001.gguf (version GGUF V3 (latest))
2024-11-06T09:01:10.863227Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-11-06T09:01:10.863359Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:110: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
gmatht commented 1 week ago

A workaround is to install an upstream llama-server.exe, e.g. llama-b4034-bin-win-vulkan-x64

zwpaper commented 1 week ago

Hi @gmatht and @Vlod-github, thank you for reporting the issue.

Did you install the Vulkan runtime before using Tabby? Could you please try running the following command and post the result here:

vulkaninfo.exe
zwpaper commented 1 week ago

I have verified this on Linux and found an issue with the Vulkan build. We will investigate further and fix it later.

https://gist.github.com/zwpaper/08e80712e1f3f82a41a1a0ee41735b2f