LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

Error running koboldcpp-linux-x64 of release 1.58 #705

Closed jiangzhengshen closed 5 months ago

jiangzhengshen commented 6 months ago

running koboldcpp-1.58 error

[xxx@centos local]$ ./koboldcpp-linux-x64-1.58 silicon-maid-7b.Q5_K_M.gguf --usecublas
***
Welcome to KoboldCpp - Version 1.58
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
Traceback (most recent call last):
  File "PyInstaller/loader/pyimod03_ctypes.py", line 53, in __init__
  File "ctypes/__init__.py", line 373, in __init__
OSError: /tmp/_MEIUwJ6Ds/koboldcpp_cublas.so: undefined symbol: getcpu

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "koboldcpp.py", line 2673, in <module>
  File "koboldcpp.py", line 2431, in main
  File "koboldcpp.py", line 232, in init_library
  File "PyInstaller/loader/pyimod03_ctypes.py", line 55, in __init__
pyimod03_ctypes.PyInstallerImportError: Failed to load dynlib/dll '/tmp/_MEIUwJ6Ds/koboldcpp_cublas.so'. Most likely this dynlib/dll was not found when the application was frozen.
[2060] Failed to execute script 'koboldcpp' due to unhandled exception!

koboldcpp-1.57.1 is ok

[xxx@centos local]$ ./koboldcpp-linux-x64-1.57.1 silicon-maid-7b.Q5_K_M.gguf --usecublas
***
Welcome to KoboldCpp - Version 1.57.1
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=103, config=None, contextsize=2048, debugmode=0, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=False, lora=None, model=None, model_param='silicon-maid-7b.Q5_K_M.gguf', multiuser=0, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory='', quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=103, useclblast=None, usecublas=[], usemlock=False, usevulkan=None)
==========
Loading model: /cfs/xxx/silicon-maid-7b.Q5_K_M.gguf 
[Threads: 103, BlasThreads: 103, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A10, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /cfs/xxx/silicon-maid-7b.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = sanjiwatsuki_silicon-maid-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4892.99 MiB
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   266.00 MiB
llama_new_context_with_model: KV self size  =  266.00 MiB, K (f16):  133.00 MiB, V (f16):  133.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    12.17 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   172.70 MiB
llama_new_context_with_model: graph splits (measure): 1
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
LostRuins commented 6 months ago

@henk717 I think this is caused by the culibos changes.

henk717 commented 6 months ago

Seems unlikely because if its culibos then why does the docker work? This is the same getcpu thing you had in your compile before. Considering this is centos its more likely that the binary is compiled on a newer distribution than centos itself so its not compatible. We can't provide older binaries because of github's CI limitations, but I expect koboldcpp.sh will work fine.

LostRuins commented 6 months ago

you are right, it's due to the getcpu method, which is not really needed at the moment.

henk717 commented 6 months ago

Can confirm this is reproducable on Ubuntu 18.04, but not solvable by compiling it on Ubuntu 18.04.

LostRuins commented 5 months ago

Hi, this should now be fixed. Please check the latest version!

NaiveYan commented 5 months ago

Can confirm 1.59 works on ubuntu16.04.