Closed YuMJie closed 8 months ago
Thanks for your interest!
If you'd like to know the size of GPU-offloaded weight tensors, you can use the following two ways combined:
LLAMA_OFFLOAD_DEBUG
defined.gpu_idx
and gpu_bucket
tensor size in the log, which indicates the amount of hidden neurons of MLP and the amount offloaded to GPU. The following example shows there are 1024 of 32768 neurons (same as matrix row/columns) offloaded.
llama_model_loader: - tensor 0: blk.0.gpu_idx i32 [ 32768, 1, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.gpu_bucket i32 [ 1024, 1, 1, 1 ]
There are no evident ways to show up the size of intermediate computation results offloaded to GPU. But you can calculate the size of each from the weight size and the offloading rule (like for mul_mat
, the result tensor will be offloaded once one operand is on the GPU).
I see you defined "sparse_pred_threshold" in llama.cpp file. Can I load the model just by this sparse threshold even if the gpu has a lot of memory?
This threshold has no effect on tensor offloading but tuning the sparsity level of the model at inference. To control how many tensors should be offloaded, you can use the --vram-budget
argument. Please have a look at the Inference examples.
Thank for your detailed explaining!
Thank you for your great job! However, I have some questions as title: How to count the size of the model and intermediate tensors on the GPU and main memory respectively. As I know, you have provided "VRAM used" and "mem required", but it may include the size of model and intermediate tensors. How can I get the size separately? And additional question:I see you defined "sparse_pred_threshold" in llama.cpp file. Can I load the model just by this sparse threshold even if the gpu has a lot of memory?