SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.96k stars 412 forks source link

How to count the size of the model and intermediate tensors on the GPU and main memory respectively #147

Closed YuMJie closed 8 months ago

YuMJie commented 9 months ago

Thank you for your great job! However, I have some questions as title: How to count the size of the model and intermediate tensors on the GPU and main memory respectively. As I know, you have provided "VRAM used" and "mem required", but it may include the size of model and intermediate tensors. How can I get the size separately? And additional question:I see you defined "sparse_pred_threshold" in llama.cpp file. Can I load the model just by this sparse threshold even if the gpu has a lot of memory?

hodlen commented 8 months ago

Thanks for your interest!

If you'd like to know the size of GPU-offloaded weight tensors, you can use the following two ways combined:

  1. To view the size of model weights offloaded to GPU, you can view the tensor offload result at model loading time once you compiled with LLAMA_OFFLOAD_DEBUG defined.
  2. To view the size of FFN model weights partially offloaded to GPU, you can find the gpu_idx and gpu_bucket tensor size in the log, which indicates the amount of hidden neurons of MLP and the amount offloaded to GPU. The following example shows there are 1024 of 32768 neurons (same as matrix row/columns) offloaded.
    llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 32768,     1,     1,     1 ]
    llama_model_loader: - tensor    1:                 blk.0.gpu_bucket i32      [  1024,     1,     1,     1 ]

There are no evident ways to show up the size of intermediate computation results offloaded to GPU. But you can calculate the size of each from the weight size and the offloading rule (like for mul_mat, the result tensor will be offloaded once one operand is on the GPU).

hodlen commented 8 months ago

I see you defined "sparse_pred_threshold" in llama.cpp file. Can I load the model just by this sparse threshold even if the gpu has a lot of memory?

This threshold has no effect on tensor offloading but tuning the sparsity level of the model at inference. To control how many tensors should be offloaded, you can use the --vram-budget argument. Please have a look at the Inference examples.

YuMJie commented 8 months ago

Thank for your detailed explaining!