amaiya / onprem

A tool for running on-premises large language models with non-public data
https://amaiya.github.io/onprem
Apache License 2.0
684 stars 32 forks source link

Curious #57

Closed agentsimon closed 6 months ago

agentsimon commented 6 months ago

I'm running this on my laptop and have installed BLAS as per the documentation but I'm not sure it's working. Whenever I run a prompt it prints this after the LLM model loading AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

Any information greatly appreciated.

amaiya commented 6 months ago

These are outputs from llama-cpp.

On a GPU machine when following the instructions for installing llama-cpp-python with GPU acceleration, you will see something like this in the output:

lm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      CUDA0 buffer size =  1967.79 MiB
llm_load_tensors:      CUDA1 buffer size =  1853.15 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 3900
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1035.94 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   914.06 MiB
llama_new_context_with_model: KV self size  = 1950.00 MiB, K (f16):  975.00 MiB, V (f16):  975.00 MiB
llama_new_context_with_model: graph splits (measure): 5
llama_new_context_with_model:      CUDA0 compute buffer size =   550.74 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   550.74 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    31.24 MiB
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
agentsimon commented 6 months ago

I did a fresh install, dropped the n_gpu_layers=12 and it works. Thanks, great job llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 304.69 MiB llama_kv_cache_init: CUDA0 KV buffer size = 182.81 MiB llama_new_context_with_model: KV self size = 487.50 MiB, K (f16): 243.75 MiB, V (f16): 243.75 MiB llama_new_context_with_model: CUDA_Host input buffer size = 35.26 MiB llama_new_context_with_model: CUDA0 compute buffer size = 566.74 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 551.50 MiB llama_new_context_with_model: graph splits (measure): 5 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |