TabbyML / tabby

Self-hosted AI coding assistant
https://tabby.tabbyml.com/
Other
21.05k stars 955 forks source link

Can get work on tabby 0.13.1 or 0.14.0 follow by the quick-start guide #2719

Open yourchanges opened 1 month ago

yourchanges commented 1 month ago

Describe the bug Can get work on tabby 0.13.1 or 0.14.0 follow by the quick-start guide, it's just start process with the a embed model

/opt/tabby/bin/llama-server -m /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf --cont-batching --port 30888 -np 1 --log-disable --ctx-size 4096 -ngl 9999 --embedding --ubatch-size 4096

and hang for ever

docker run -it --name tabbyserver4 --restart=unless-stopped --gpus '"device=0"' -p 8082:8080    -v /data/tabby:/data tabbyml/tabby serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct --device cuda
Writing to new file.
🎯 Downloaded https://huggingface.co/TabbyML/models/resolve/main/starcoderbase-1B.Q8_0.gguf to /data/models/TabbyML/StarCoder-1B/ggml/model.gguf.tmp
   00:03:02 ▕████████████████████▏ 1.23 GiB/1.23 GiB  6.88 MiB/s  ETA 0s.                                                                                                                                                                   ✅ Checksum OK.
Writing to new file.
🎯 Downloaded https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-q8_0.gguf to /data/models/TabbyML/Qwen2-1.5B-Instruct/ggml/model.gguf.tmp
   00:03:37 ▕████████████████████▏ 1.53 GiB/1.53 GiB  7.22 MiB/s  ETA 0s.                                                                                                                                                                   ✅ Checksum OK.
⠋  2173.060 s   Starting...2024-07-24T07:25:27.218916Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:99: llama-server <embedding> exited with status code -1
2024-07-24T07:25:27.218935Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf (version GGUF V3 (latest))
2024-07-24T07:25:27.218940Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-24T07:25:27.218943Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-07-24T07:25:27.218946Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-07-24T07:25:27.218950Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-07-24T07:25:27.218953Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-07-24T07:25:27.218960Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-07-24T07:25:27.218962Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-07-24T07:25:27.218964Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-07-24T07:25:27.218965Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-07-24T07:25:27.218968Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-07-24T07:25:27.218971Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-07-24T07:25:27.218974Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-07-24T07:25:27.218982Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-07-24T07:25:27.218983Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-07-24T07:25:27.218985Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-07-24T07:25:27.218986Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-07-24T07:25:27.218988Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-07-24T07:25:27.218991Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-07-24T07:25:27.218992Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-07-24T07:25:27.218994Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-07-24T07:25:27.218996Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-07-24T07:25:27.218999Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-07-24T07:25:27.219005Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-07-24T07:25:27.219009Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-07-24T07:25:27.219012Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-07-24T07:25:27.219016Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-07-24T07:25:27.219021Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-07-24T07:25:27.219026Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-07-24T07:25:27.219031Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-07-24T07:25:27.219036Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-07-24T07:25:27.219041Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-07-24T07:25:27.219047Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-07-24T07:25:27.219051Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_merges         = 0
2024-07-24T07:25:27.219056Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-07-24T07:25:27.219064Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-07-24T07:25:27.219071Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd           = 768
2024-07-24T07:25:27.219078Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_layer          = 12
2024-07-24T07:25:27.219084Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head           = 12
2024-07-24T07:25:27.219091Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-07-24T07:25:27.219099Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_rot            = 64
2024-07-24T07:25:27.219105Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_swa            = 0
2024-07-24T07:25:27.219111Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-07-24T07:25:27.219118Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-07-24T07:25:27.219125Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-07-24T07:25:27.219133Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-07-24T07:25:27.219139Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-07-24T07:25:27.219143Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-07-24T07:25:27.219149Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-07-24T07:25:27.219157Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-07-24T07:25:27.219176Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-24T07:25:27.219186Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-07-24T07:25:27.219193Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-07-24T07:25:27.219210Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert         = 0
2024-07-24T07:25:27.219218Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-07-24T07:25:27.219224Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: causal attn      = 0
2024-07-24T07:25:27.219251Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: pooling type     = 1
2024-07-24T07:25:27.219254Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope type        = 2
2024-07-24T07:25:27.219257Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-07-24T07:25:27.219260Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-07-24T07:25:27.219265Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-07-24T07:25:27.219270Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-07-24T07:25:27.219275Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-07-24T07:25:27.219280Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-07-24T07:25:27.219286Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-07-24T07:25:27.219291Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-07-24T07:25:27.219298Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-07-24T07:25:27.219304Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model type       = 137M
2024-07-24T07:25:27.219309Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-07-24T07:25:27.219315Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-07-24T07:25:27.219329Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-07-24T07:25:27.219334Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-07-24T07:25:27.219343Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-07-24T07:25:27.219347Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-07-24T07:25:27.219352Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-07-24T07:25:27.219362Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-07-24T07:25:27.219365Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-07-24T07:25:27.219371Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-07-24T07:25:27.219374Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-07-24T07:25:27.219376Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-07-24T07:25:27.219378Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: max token length = 21
2024-07-24T07:25:27.219381Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
2024-07-24T07:25:27.219387Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2024-07-24T07:25:27.219390Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: found 1 CUDA devices:
2024-07-24T07:25:27.219392Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>:   Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
⠸  2174.102 s   Starting...^C2024-07-24T07:25:28.289106Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:99: llama-server <embedding> exited with status code -1
2024-07-24T07:25:28.289123Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf (version GGUF V3 (latest))
2024-07-24T07:25:28.289126Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-24T07:25:28.289129Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-07-24T07:25:28.289132Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-07-24T07:25:28.289135Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-07-24T07:25:28.289138Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-07-24T07:25:28.289141Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-07-24T07:25:28.289143Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-07-24T07:25:28.289146Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-07-24T07:25:28.289149Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-07-24T07:25:28.289152Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-07-24T07:25:28.289155Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-07-24T07:25:28.289157Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-07-24T07:25:28.289160Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-07-24T07:25:28.289162Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-07-24T07:25:28.289165Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-07-24T07:25:28.289168Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-07-24T07:25:28.289170Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-07-24T07:25:28.289173Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-07-24T07:25:28.289176Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-07-24T07:25:28.289178Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-07-24T07:25:28.289181Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-07-24T07:25:28.289184Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-07-24T07:25:28.289187Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-07-24T07:25:28.289189Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-07-24T07:25:28.289192Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-07-24T07:25:28.289195Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-07-24T07:25:28.289197Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-07-24T07:25:28.289200Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-07-24T07:25:28.289203Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-07-24T07:25:28.289205Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-07-24T07:25:28.289208Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-07-24T07:25:28.289211Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-07-24T07:25:28.289213Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_merges         = 0
2024-07-24T07:25:28.289216Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-07-24T07:25:28.289219Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-07-24T07:25:28.289222Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd           = 768
2024-07-24T07:25:28.289224Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_layer          = 12
2024-07-24T07:25:28.289227Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head           = 12
2024-07-24T07:25:28.289230Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-07-24T07:25:28.289233Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_rot            = 64
2024-07-24T07:25:28.289256Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_swa            = 0
2024-07-24T07:25:28.289259Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-07-24T07:25:28.289261Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-07-24T07:25:28.289264Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-07-24T07:25:28.289266Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-07-24T07:25:28.289269Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-07-24T07:25:28.289272Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-07-24T07:25:28.289274Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-07-24T07:25:28.289277Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-07-24T07:25:28.289292Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-24T07:25:28.289295Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-07-24T07:25:28.289305Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-07-24T07:25:28.289308Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert         = 0
2024-07-24T07:25:28.289311Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-07-24T07:25:28.289325Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: causal attn      = 0
2024-07-24T07:25:28.289328Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: pooling type     = 1
2024-07-24T07:25:28.289330Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope type        = 2
2024-07-24T07:25:28.289332Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-07-24T07:25:28.289335Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-07-24T07:25:28.289338Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-07-24T07:25:28.289348Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-07-24T07:25:28.289351Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-07-24T07:25:28.289353Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-07-24T07:25:28.289355Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-07-24T07:25:28.289357Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-07-24T07:25:28.289364Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-07-24T07:25:28.289366Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model type       = 137M
2024-07-24T07:25:28.289367Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-07-24T07:25:28.289369Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-07-24T07:25:28.289371Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-07-24T07:25:28.289373Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-07-24T07:25:28.289378Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-07-24T07:25:28.289380Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-07-24T07:25:28.289383Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-07-24T07:25:28.289386Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-07-24T07:25:28.289388Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-07-24T07:25:28.289390Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-07-24T07:25:28.289392Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-07-24T07:25:28.289394Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-07-24T07:25:28.289397Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: max token length = 21
2024-07-24T07:25:28.289399Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
2024-07-24T07:25:28.289403Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2024-07-24T07:25:28.289405Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: found 1 CUDA devices:
2024-07-24T07:25:28.289408Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>:   Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

Information about your version 0.14.0 or 0.13.1

Information about your GPU

Wed Jul 24 15:30:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0 Off |                  N/A |
| 50%   43C    P8              19W / 320W |     29MiB / 20480MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1995      G   /usr/libexec/Xorg                            12MiB |
|    0   N/A  N/A      3008      G   gnome-shell                                   4MiB |
|    0   N/A  N/A      3923      G   /usr/libexec/gnome-initial-setup              3MiB |
+---------------------------------------------------------------------------------------+
kba-tmn3 commented 1 month ago

I have the same issue, how to troubleshoot it?

kitswas commented 4 weeks ago

Same here. Running with

docker run -it --gpus all   -p 8080:8080 -v $HOME/.tabby:/data   tabbyml/tabby serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct --device cuda

GPU info:

Thu Aug 15 10:11:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P0               6W /  50W |      3MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2642      G   /usr/bin/gnome-shell                          1MiB |
+---------------------------------------------------------------------------------------+
MaxenceBouvier commented 3 weeks ago

Same issue as well. Going back to tabby v12.0 seems to work for me. (When serving CodeGemma-7B, without webserver.)

wsxiaoys commented 3 weeks ago

Thank you for reporting the issues. The changes in https://github.com/TabbyML/tabby/pull/2925/files will be included in the 0.16 release and will provide more detailed information in the logs to assist with debugging.