SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.98k stars 415 forks source link

Need quite a long time to load the model #188

Open meicale opened 6 months ago

meicale commented 6 months ago

Prerequisites

Before submitting your question, please ensure the following:

Question Details

Please provide a clear and concise description of your question. If applicable, include steps to reproduce the issue or behaviors you've observed. As the title : Need quite a long time to load the model

Additional Context

Please provide any additional information that may be relevant to your question, such as specific system configurations, environment details, or any other context that could be helpful in addressing your inquiry. I am using wsl2/ubuntu22.04 on windows 11 with a plugin nvidia gpu --- rtx 4060ti.

....................................................................................................
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    1:                 blk.0.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    blk.1.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    3:                 blk.1.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    4:                    blk.2.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    5:                 blk.2.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    6:                    blk.3.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    7:                 blk.3.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    8:                    blk.4.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    9:                 blk.4.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   10:                    blk.5.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   11:                 blk.5.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   12:                    blk.6.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   13:                 blk.6.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   14:                    blk.7.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   15:                 blk.7.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   16:                    blk.8.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   17:                 blk.8.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   18:                    blk.9.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   19:                 blk.9.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   20:                   blk.10.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   21:                blk.10.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   22:                   blk.11.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   23:                blk.11.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   24:                   blk.12.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   25:                blk.12.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   26:                   blk.13.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   27:                blk.13.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   28:                   blk.14.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   29:                blk.14.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   30:                   blk.15.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   31:                blk.15.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   32:                   blk.16.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   33:                blk.16.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   34:                   blk.17.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   35:                blk.17.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   36:                   blk.18.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   37:                blk.18.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   38:                   blk.19.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   39:                blk.19.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   40:                   blk.20.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   41:                blk.20.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   42:                   blk.21.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   43:                blk.21.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   44:                   blk.22.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   45:                blk.22.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   46:                   blk.23.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   47:                blk.23.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   48:                   blk.24.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   49:                blk.24.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   50:                   blk.25.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   51:                blk.25.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   52:                   blk.26.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   53:                blk.26.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   54:                   blk.27.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   55:                blk.27.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   56:                   blk.28.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   57:                blk.28.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   58:                   blk.29.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   59:                blk.29.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 9186557952
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (168.43 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (4191.19 ms)
llm_load_gpu_split: offloaded 8256.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 580/836
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 87.07 MB
llama_new_context_with_model: VRAM scratch buffer: 85.50 MB
llama_new_context_with_model: total VRAM used: 22793.52 MB (model: 14196.02 MB, context: 341.50 MB)

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1
| SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0

once upon a time, there was a kingdom of junk.
their king was named clutter.
he had many wives. the queen was called junk.
her son was named mess.
one day a young prince (me) walked into the castle and found out what he looked like. He immediately knew that he must become king of his own domain so he could get rid of all the clutter and make everything clean and orderly again. The prince set off on his journey and met many interesting characters along the way, from a wise old man who showed him how to keep order in his
kingdom by using the principle of “
llama_print_timings:        load time =  640021.06 ms
llama_print_timings:      sample time =      74.77 ms /   128 runs   (    0.58 ms per token,  1711.92 tokens per second)
llama_print_timings: prompt eval time =      70.04 ms /     5 tokens (   14.01 ms per token,    71.39 tokens per second)
llama_print_timings:        eval time =    7664.05 ms /   127 runs   (   60.35 ms per token,    16.57 tokens per second)
llama_print_timings:       total time =    8012.75 ms
Log end

Thank you !