Open SabinStargem opened 7 months ago
Hmm... that's usual. The numbers that you see in the launcher are retrieved from the nvidia-smi
function. What cards and in what order do you see when you run the command nvidia-smi
?
Here is what I got from the terminal:
PowerShell 7.3.11 PS C:\KoboldCPP> nvidia-smi Tue Jan 23 23:25:08 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 546.65 Driver Version: 546.65 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 WDDM | 00000000:03:00.0 Off | N/A | | 0% 32C P8 14W / 170W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 WDDM | 00000000:0A:00.0 On | Off | | 0% 42C P2 69W / 450W | 23918MiB / 24564MiB | 1% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 1 N/A N/A 1800 C+G ...8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 1 N/A N/A 2880 C+G ...8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 1 N/A N/A 4472 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 1 N/A N/A 7128 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 1 N/A N/A 9840 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 1 N/A N/A 11852 C+G ...m Files\Mozilla Firefox\firefox.exe N/A | | 1 N/A N/A 12328 C+G C:\Windows\explorer.exe N/A | | 1 N/A N/A 12592 C+G ...m Files\Mozilla Firefox\firefox.exe N/A | | 1 N/A N/A 13536 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 1 N/A N/A 14664 C+G ...GeForce Experience\NVIDIA Share.exe N/A | | 1 N/A N/A 15668 C+G ...GeForce Experience\NVIDIA Share.exe N/A | | 1 N/A N/A 16512 C+G ...US\ArmouryDevice\asus_framework.exe N/A | | 1 N/A N/A 18928 C ...ldCPP\koboldcpp_smooth_sampling.exe N/A | | 1 N/A N/A 19272 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 1 N/A N/A 19500 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | +---------------------------------------------------------------------------------------+
Okay I am not too sure too. Anyone else seeing the same situation? Basically I am allocating the card with CUDA_VISIBLE_DEVICES
set to a specific GPU but somehow your order there doesn't match the default order.
Another thing you can try is setting it to "ALL" and allocating via tensor_split, e.g. 1.0 0.0
or 0.0 1.0
, and check if its different.
Always has been that way. I think its because your list in the GUI starts from 1, whereas in nvidia-smi its from 0.
With multi-GPU, the 4090 is listed first.
Welcome to KoboldCpp - Version 1.54 For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll
Namespace(model=None, model_param='C:/KoboldCPP/Models/cognitivecomputations_laserxtral-Q6_K.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=31, blasthreads=31, highpriority=False, contextsize=32768, blasbatchsize=512, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=True, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', 'mmq'], gpulayers=18, tensor_split=[7.0, 3.0], onready='', multiuser=1, remotetunnel=False, foreground=False, preloadstory=None, quiet=False, ssl=None)
Loading model: C:\KoboldCPP\Models\cognitivecomputations_laserxtral-Q6_K.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]
The reported GGUF Arch is: llama
Identified as LLAMA model: (ver 6) Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 25 key-value pairs and 611 tensors from C:\KoboldCPP\Models\cognitivecomputations_laseー7Z轡Fllm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 4 llm_load_print_meta: n_expert_used = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 24.15 B llm_load_print_meta: model size = 18.45 GiB (6.56 BPW) llm_load_print_meta: general.name = models llm_load_print_meta: BOS token = 1 '
' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 1 ' ' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.23 MiB llm_load_tensors: using CUDA for GPU acceleration llm_load_tensors: system memory used = 8383.08 MiB llm_load_tensors: VRAM used = 10514.25 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloaded 18/33 layers to GPU .................................................................................................... Automatic RoPE Scaling: Using (scale:1.000, base:10000.0). llama_new_context_with_model: n_ctx = 32848 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 2309.62 MB llama_new_context_with_model: KV self size = 4106.00 MiB, K (f16): 2053.00 MiB, V (f16): 2053.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 2172.36 MiB llama_new_context_with_model: VRAM scratch buffer: 2169.18 MiB llama_new_context_with_model: total VRAM used: 14993.05 MiB (model: 10514.25 MiB, context: 4478.80 MiB) Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/Please connect to custom endpoint at http://localhost:5001
Yeap so for some reason, the order the GPUs get listed are inconsistent. But in either case it's a small issue, just change the index you wanna use and it should be okay.
I just got an RTX 4090, so I was eager to try it out. However, KoboldCPP is confused. In the launcher, selecting ID 2, it says that it is 4090. However, on booting, the card is identified as my 3060 in ID 0. If I use ID 1, the 4090 is mentioned. I ran a trial generation for both ID 1 and 2 to see how long it took. ID 1 is definitely the 4090, as it worked much faster.