Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
20.29k stars 1.02k forks source link

ngl explained #335

Open gretadolcetti opened 7 months ago

gretadolcetti commented 7 months ago

Can someone please explain what the required ngl argument does in practice?

From the documentation

-ngl N, --n-gpu-layers N
             Number of layers to store in VRAM.

Now my question is, how and in which trend does the number N affect the performance of the LLM? Is it just a question of GPU memory and therefore speed? Does it effect also the output produced? Are the layers used always the same (i.d. every layer of the LLM)? Thanks

phineas-pta commented 7 months ago

yes just a matter of speed

xatier commented 6 months ago

It controls how many "computations" you'd like to offload to the GPU, you should see something like this in the log:

llm_load_tensors: offloading x repeating layers to GPU                                                                                                                   
llm_load_tensors: offloaded x/33 layers to GPU

Ideally, you'd like to have 33/33 layers to GPU if resources permit.

Ref: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#additional-options