SciSharp / LLamaSharp

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
https://scisharp.github.io/LLamaSharp
MIT License
2.63k stars 342 forks source link

WebAPI project isn't using GPU, even though CUDA backend gets loaded by LlamaSharp.dll #278

Open PaulaScholz opened 11 months ago

PaulaScholz commented 11 months ago

The WebAPI project does not appear to use the GPU, even though I can see the CUDA 12 llamalib.dll being loaded in TryLoad. In both the StatefulChatService and StatelessChatService, I have set the model parameters to

    var @params = new Common.ModelParams(configuration["ModelPath"])
    {
        ContextSize = 2048,
        BatchSize = 2048,
        GpuLayerCount = 48,
    };

Task Manager reports the app is not using the GPU. Here is the output of the model in the console, as you can see gpulayers isn't set:

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from c:\PaulaLlamaModels\llama-2-13b-chat.Q5_K_M.gguf (version GGUF V2) llama_model_loader: - tensor 0: token_embd.weight q5_K [ 5120, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 2: blk.0.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 3: blk.0.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 4: blk.0.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 6: blk.0.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 7: blk.0.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 9: blk.0.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 10: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 11: blk.1.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 12: blk.1.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 13: blk.1.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 14: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 15: blk.1.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 16: blk.1.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 17: blk.1.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 18: blk.1.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 19: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 20: blk.10.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 21: blk.10.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 22: blk.10.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 23: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 24: blk.10.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 25: blk.10.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 26: blk.10.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 27: blk.10.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 28: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 29: blk.11.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 30: blk.11.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 31: blk.11.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 32: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 33: blk.11.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 34: blk.11.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 35: blk.11.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 36: blk.11.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 37: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 38: blk.12.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 39: blk.12.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 40: blk.12.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 41: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 42: blk.12.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 43: blk.12.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 44: blk.12.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 45: blk.12.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 46: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 47: blk.13.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 48: blk.13.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 49: blk.13.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 50: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 51: blk.13.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 52: blk.13.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 53: blk.13.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 54: blk.13.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 55: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 56: blk.14.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 57: blk.14.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 58: blk.14.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 59: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 60: blk.14.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 61: blk.14.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 62: blk.14.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 63: blk.14.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 64: blk.15.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 65: blk.15.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 66: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 67: blk.2.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.2.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 69: blk.2.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 70: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 71: blk.2.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 72: blk.2.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 73: blk.2.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 74: blk.2.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 75: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 76: blk.3.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 77: blk.3.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 78: blk.3.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 79: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 80: blk.3.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 81: blk.3.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 82: blk.3.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 83: blk.3.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 84: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 85: blk.4.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.4.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 87: blk.4.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 88: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 89: blk.4.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 90: blk.4.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 91: blk.4.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 92: blk.4.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 93: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 94: blk.5.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.5.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 96: blk.5.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 97: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 98: blk.5.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 99: blk.5.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 100: blk.5.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 101: blk.5.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 102: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 103: blk.6.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.6.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 105: blk.6.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 106: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 107: blk.6.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 108: blk.6.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 109: blk.6.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 110: blk.6.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 111: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 112: blk.7.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.7.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 114: blk.7.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 115: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 116: blk.7.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 117: blk.7.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 118: blk.7.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 119: blk.7.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 120: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 121: blk.8.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.8.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 123: blk.8.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 124: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 125: blk.8.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 126: blk.8.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 127: blk.8.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 128: blk.8.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 129: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 130: blk.9.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 131: blk.9.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 132: blk.9.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 133: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 134: blk.9.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 135: blk.9.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 136: blk.9.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 137: blk.9.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 139: blk.15.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 141: blk.15.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 143: blk.15.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 144: blk.15.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 145: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 146: blk.16.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 147: blk.16.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 148: blk.16.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 149: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 150: blk.16.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 151: blk.16.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 152: blk.16.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 153: blk.16.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 154: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 155: blk.17.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 156: blk.17.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 157: blk.17.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 158: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 159: blk.17.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 160: blk.17.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 161: blk.17.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 162: blk.17.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 163: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 164: blk.18.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 165: blk.18.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 166: blk.18.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 167: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 168: blk.18.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 169: blk.18.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 170: blk.18.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 171: blk.18.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 172: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 173: blk.19.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 174: blk.19.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 175: blk.19.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 176: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 177: blk.19.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 178: blk.19.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 179: blk.19.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 180: blk.19.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 181: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 182: blk.20.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 183: blk.20.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 184: blk.20.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 185: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 186: blk.20.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 187: blk.20.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 188: blk.20.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 189: blk.20.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 190: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 191: blk.21.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 192: blk.21.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 193: blk.21.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 194: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 195: blk.21.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 196: blk.21.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 197: blk.21.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 198: blk.21.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 199: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 200: blk.22.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 201: blk.22.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 202: blk.22.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 203: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 204: blk.22.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 205: blk.22.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 206: blk.22.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 207: blk.22.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 208: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 209: blk.23.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 210: blk.23.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 211: blk.23.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 212: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 213: blk.23.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 214: blk.23.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 215: blk.23.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 216: blk.23.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 217: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 218: blk.24.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 219: blk.24.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 220: blk.24.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 221: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 222: blk.24.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 223: blk.24.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 224: blk.24.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 225: blk.24.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 226: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 227: blk.25.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 228: blk.25.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 229: blk.25.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 230: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 231: blk.25.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 232: blk.25.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 233: blk.25.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 234: blk.25.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 235: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 236: blk.26.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 237: blk.26.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 238: blk.26.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 239: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 240: blk.26.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 241: blk.26.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 242: blk.26.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 243: blk.26.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 244: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 245: blk.27.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 246: blk.27.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 247: blk.27.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 248: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 249: blk.27.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 250: blk.27.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 251: blk.27.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 252: blk.27.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 253: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 254: blk.28.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 255: blk.28.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 256: blk.28.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 257: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 258: blk.28.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 259: blk.28.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 260: blk.28.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 261: blk.28.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 262: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 263: blk.29.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 264: blk.29.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 265: blk.29.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 266: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 267: blk.29.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 268: blk.29.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 270: blk.29.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 271: blk.30.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 272: blk.30.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 275: blk.30.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 276: blk.30.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 277: output.weight q6_K [ 5120, 32000, 1, 1 ] llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 280: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 282: blk.31.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 283: blk.31.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 284: blk.31.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 285: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 286: blk.31.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 287: blk.31.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 288: blk.31.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 289: blk.31.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 290: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 291: blk.32.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.32.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 293: blk.32.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 294: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 295: blk.32.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 296: blk.32.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 297: blk.32.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 298: blk.32.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 299: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 300: blk.33.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 301: blk.33.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 302: blk.33.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 303: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 304: blk.33.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 305: blk.33.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 306: blk.33.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 307: blk.33.attn_v.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 308: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 309: blk.34.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.34.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 311: blk.34.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 312: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 313: blk.34.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 314: blk.34.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 315: blk.34.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 316: blk.34.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 317: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 318: blk.35.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 319: blk.35.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 320: blk.35.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 321: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 322: blk.35.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 323: blk.35.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 324: blk.35.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 325: blk.35.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 326: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 327: blk.36.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.36.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 329: blk.36.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 330: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 331: blk.36.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 332: blk.36.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 333: blk.36.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 334: blk.36.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 335: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 336: blk.37.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.37.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 338: blk.37.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 339: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 340: blk.37.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 341: blk.37.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 342: blk.37.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 343: blk.37.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 344: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 345: blk.38.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.38.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 347: blk.38.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 348: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 349: blk.38.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 350: blk.38.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 351: blk.38.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 352: blk.38.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 353: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 354: blk.39.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.39.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 356: blk.39.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 357: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 358: blk.39.attn_k.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 359: blk.39.attn_output.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 360: blk.39.attn_q.weight q5_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 361: blk.39.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 362: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: general.file_type u32 llama_model_loader: - kv 11: tokenizer.ggml.model str llama_model_loader: - kv 12: tokenizer.ggml.tokens arr llama_model_loader: - kv 13: tokenizer.ggml.scores arr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 llama_model_loader: - kv 18: general.quantization_version u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q5_K: 241 tensors llama_model_loader: - type q6_K: 41 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q5_K - Medium llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 8.60 GiB (5.67 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: mem required = 8801.76 MB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1600.00 MB llama_build_graph: non-view tensors processed: 924/924 llama_new_context_with_model: compute buffer total size = 782.64 MB

jinhuck commented 11 months ago

The same problem. LLAMA.WEB is not currently utilizing CUDA.

AsakusaRinne commented 11 months ago

@PaulaScholz For LLama.WebAPI you could enable cuda usage by adding GpuLayerCount = xxx here (or the corresponding location of this file for stateful chat). This project was added in early days and it's more like a demo. As discussed in #269, we'll move the current WebAPI project to examples and start a formal web api support.

@jinhuck For LLama.Web, however, it's an unexpected behaviour because the GpuLayerCount is assigned. Could you please pull the latest master branch and run it again with NativeLibraryConfig.WithLogs() adding to the very beginning of your code, to see which library was loaded? If the log shows it loaded a cpu version, you could try to specify using the cuda library with NativeLibraryConfig.WithLibrary({CUDA_LIBRARY_PATH}).

jinhuck commented 11 months ago

The 0.8.0 version is utilizing CUDA and showing stable performance.
0.8.0 is very solid version!! +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ $ dotnet run [LLamaSharp Native] [Info] Detected OS Platform: WINDOWS [LLamaSharp Native] [Info] Detected cuda major version 12. [LLamaSharp Native] [Info] C:\Projects\LLamaSharp-0.8.0\LLama.Web\bin\Debug\net7.0\runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully. ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: Tesla P40, compute capability 6.1 Device 1: Quadro M4000, compute capability 5.2 info: Microsoft.Hosting.Lifetime[14] Now listening on: https://localhost:51595 info: Microsoft.Hosting.Lifetime[14] Now listening on: http://localhost:51596

vvdb-architecture commented 9 months ago

Unfortunately, it doesn't work for me, even though the CUDA dll (0.8.1) is loaded.

[LLamaSharp Native] [Info] NativeLibraryConfig Description:
- Path:
- PreferCuda: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- Logging: True
- SearchDirectories and Priorities: { ./ }
[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Data\llama-2-7b-guanaco-qlora.Q8_0.gguf (version GGUF V2)

I have installed CUDA, and validated that it is indeed correctly installed. A small demo program correctly returns 1 GPU (I have 1 graphics card).