LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.67k stars 333 forks source link

[BUG] (v1.55.1 / Kalomaze Smooth Build) - GPU ID mixed up. #634

Open SabinStargem opened 7 months ago

SabinStargem commented 7 months ago

I just got an RTX 4090, so I was eager to try it out. However, KoboldCPP is confused. In the launcher, selecting ID 2, it says that it is 4090. However, on booting, the card is identified as my 3060 in ID 0. If I use ID 1, the 4090 is mentioned. I ran a trial generation for both ID 1 and 2 to see how long it took. ID 1 is definitely the 4090, as it worked much faster.

image


Welcome to KoboldCpp - Version 1.54 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='C:/KoboldCPP/Models/bagel-hermes-2x34b.Q6_K.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=31, blasthreads=31, highpriority=False, contextsize=32768, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=True, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1', 'mmq'], gpulayers=22, tensor_split=None, onready='', multiuser=1, remotetunnel=False, foreground=False, preloadstory=None, quiet=False, ssl=None)

Loading model: C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama


Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 26 key-value pairs and 783 tensors from C:\KoboldCPP\Models\bagel-hermes-2x34b.Q6_K.gg%ラ0ホュllm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 64000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 200000 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_head = 56 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 60 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 20480 llm_load_print_meta: n_expert = 2 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 5000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 200000 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 30B llm_load_print_meta: model ftype = unknown, may not work (guessed) llm_load_print_meta: model params = 60.81 B llm_load_print_meta: model size = 46.46 GiB (6.56 BPW) llm_load_print_meta: general.name = weyaxi_bagel-hermes-2x34b llm_load_print_meta: BOS token = 1 '<|startoftext|>' llm_load_print_meta: EOS token = 2 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 1 '<|startoftext|>' llm_load_print_meta: LF token = 315 '<0x0A>' llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 30396.84 MiB llm_load_tensors: VRAM used = 17182.43 MiB llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloaded 22/61 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 5000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 2822.88 MB llama_new_context_with_model: KV self size = 7698.75 MiB, K (f16): 3849.38 MiB, V (f16): 3849.38 MiB llama_build_graph: non-view tensors processed: 2104/2104 llama_new_context_with_model: compute buffer total size = 3742.11 MiB llama_new_context_with_model: VRAM scratch buffer: 3738.92 MiB llama_new_context_with_model: total VRAM used: 23744.22 MiB (model: 17182.43 MiB, context: 6561.79 MiB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001

LostRuins commented 7 months ago

Hmm... that's usual. The numbers that you see in the launcher are retrieved from the nvidia-smi function. What cards and in what order do you see when you run the command nvidia-smi ?

SabinStargem commented 7 months ago

Here is what I got from the terminal:

PowerShell 7.3.11 PS C:\KoboldCPP> nvidia-smi Tue Jan 23 23:25:08 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 546.65 Driver Version: 546.65 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 WDDM | 00000000:03:00.0 Off | N/A | | 0% 32C P8 14W / 170W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:0A:00.0 On | Off | | 0% 42C P2 69W / 450W | 23918MiB / 24564MiB | 1% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 1 N/A N/A 1800 C+G ...8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 1 N/A N/A 2880 C+G ...8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 1 N/A N/A 4472 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 1 N/A N/A 7128 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 1 N/A N/A 9840 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 1 N/A N/A 11852 C+G ...m Files\Mozilla Firefox\firefox.exe N/A | | 1 N/A N/A 12328 C+G C:\Windows\explorer.exe N/A | | 1 N/A N/A 12592 C+G ...m Files\Mozilla Firefox\firefox.exe N/A | | 1 N/A N/A 13536 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 1 N/A N/A 14664 C+G ...GeForce Experience\NVIDIA Share.exe N/A | | 1 N/A N/A 15668 C+G ...GeForce Experience\NVIDIA Share.exe N/A | | 1 N/A N/A 16512 C+G ...US\ArmouryDevice\asus_framework.exe N/A | | 1 N/A N/A 18928 C ...ldCPP\koboldcpp_smooth_sampling.exe N/A | | 1 N/A N/A 19272 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 1 N/A N/A 19500 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | +---------------------------------------------------------------------------------------+

LostRuins commented 7 months ago

Okay I am not too sure too. Anyone else seeing the same situation? Basically I am allocating the card with CUDA_VISIBLE_DEVICES set to a specific GPU but somehow your order there doesn't match the default order.

Another thing you can try is setting it to "ALL" and allocating via tensor_split, e.g. 1.0 0.0 or 0.0 1.0, and check if its different.

askmyteapot commented 7 months ago

image Always has been that way. I think its because your list in the GUI starts from 1, whereas in nvidia-smi its from 0.

SabinStargem commented 7 months ago

With multi-GPU, the 4090 is listed first.


Welcome to KoboldCpp - Version 1.54 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='C:/KoboldCPP/Models/cognitivecomputations_laserxtral-Q6_K.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=31, blasthreads=31, highpriority=False, contextsize=32768, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=True, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', 'mmq'], gpulayers=18, tensor_split=[7.0, 3.0], onready='', multiuser=1, remotetunnel=False, foreground=False, preloadstory=None, quiet=False, ssl=None)

Loading model: C:\KoboldCPP\Models\cognitivecomputations_laserxtral-Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama


Identified as LLAMA model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 25 key-value pairs and 611 tensors from C:\KoboldCPP\Models\cognitivecomputations_laseー7Z轡Fllm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 4 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 24.15 B llm_load_print_meta: model size = 18.45 GiB (6.56 BPW) llm_load_print_meta: general.name = models llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 1 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.23 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 8383.08 MiB llm_load_tensors: VRAM used = 10514.25 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloaded 18/33 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using (scale:1.000, base:10000.0). llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 2309.62 MB llama_new_context_with_model: KV self size = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 2172.36 MiB llama_new_context_with_model: VRAM scratch buffer: 2169.18 MiB llama_new_context_with_model: total VRAM used: 14993.05 MiB (model: 10514.25 MiB, context: 4478.80 MiB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001

LostRuins commented 7 months ago

Yeap so for some reason, the order the GPUs get listed are inconsistent. But in either case it's a small issue, just change the index you wanna use and it should be okay.