Open gretadolcetti opened 7 months ago
yes just a matter of speed
It controls how many "computations" you'd like to offload to the GPU, you should see something like this in the log:
llm_load_tensors: offloading x repeating layers to GPU
llm_load_tensors: offloaded x/33 layers to GPU
Ideally, you'd like to have 33/33 layers to GPU if resources permit.
Ref: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#additional-options
Can someone please explain what the required
ngl
argument does in practice?From the documentation
Now my question is, how and in which trend does the number N affect the performance of the LLM? Is it just a question of GPU memory and therefore speed? Does it effect also the output produced? Are the layers used always the same (i.d. every layer of the LLM)? Thanks