Closed agentsimon closed 6 months ago
These are outputs from llama-cpp.
On a GPU machine when following the instructions for installing llama-cpp-python
with GPU acceleration, you will see something like this in the output:
lm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 1967.79 MiB
llm_load_tensors: CUDA1 buffer size = 1853.15 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 3900
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1035.94 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 914.06 MiB
llama_new_context_with_model: KV self size = 1950.00 MiB, K (f16): 975.00 MiB, V (f16): 975.00 MiB
llama_new_context_with_model: graph splits (measure): 5
llama_new_context_with_model: CUDA0 compute buffer size = 550.74 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 550.74 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 31.24 MiB
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
I did a fresh install, dropped the n_gpu_layers=12 and it works. Thanks, great job llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 304.69 MiB llama_kv_cache_init: CUDA0 KV buffer size = 182.81 MiB llama_new_context_with_model: KV self size = 487.50 MiB, K (f16): 243.75 MiB, V (f16): 243.75 MiB llama_new_context_with_model: CUDA_Host input buffer size = 35.26 MiB llama_new_context_with_model: CUDA0 compute buffer size = 566.74 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 551.50 MiB llama_new_context_with_model: graph splits (measure): 5 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
I'm running this on my laptop and have installed BLAS as per the documentation but I'm not sure it's working. Whenever I run a prompt it prints this after the LLM model loading AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Any information greatly appreciated.