Please provide a clear and concise description of your question. If applicable, include steps to reproduce the issue or behaviors you've observed.
As the title : Need quite a long time to load the model
Additional Context
Please provide any additional information that may be relevant to your question, such as specific system configurations, environment details, or any other context that could be helpful in addressing your inquiry.
I am using wsl2/ubuntu22.04 on windows 11 with a plugin nvidia gpu --- rtx 4060ti.
....................................................................................................
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor 0: blk.0.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.1.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 3: blk.1.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 4: blk.2.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 5: blk.2.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.3.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 7: blk.3.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 8: blk.4.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.4.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.5.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 11: blk.5.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.6.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 13: blk.6.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 14: blk.7.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 15: blk.7.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 16: blk.8.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 17: blk.8.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.9.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.9.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 20: blk.10.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 21: blk.10.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 22: blk.11.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 23: blk.11.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.12.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 25: blk.12.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 26: blk.13.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.13.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.14.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 29: blk.14.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.15.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 31: blk.15.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 32: blk.16.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 33: blk.16.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 34: blk.17.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 35: blk.17.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.18.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.18.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 38: blk.19.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 39: blk.19.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 40: blk.20.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 41: blk.20.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.21.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 43: blk.21.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 44: blk.22.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.22.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 46: blk.23.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 47: blk.23.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.24.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 49: blk.24.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 50: blk.25.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 51: blk.25.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 52: blk.26.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 53: blk.26.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.27.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.27.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 56: blk.28.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 57: blk.28.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 58: blk.29.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 59: blk.29.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.30.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 61: blk.30.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 62: blk.31.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.31.gpu_bucket i32 [ 11008, 1, 1, 1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: generic.gpu_index.block_count u32
llama_model_loader: - kv 2: split.vram_capacity u64
llama_model_loader: - type i32: 64 tensors
loaded gpu_idx, vram_required: 9186557952
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (168.43 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (4191.19 ms)
llm_load_gpu_split: offloaded 8256.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 580/836
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 87.07 MB
llama_new_context_with_model: VRAM scratch buffer: 85.50 MB
llama_new_context_with_model: total VRAM used: 22793.52 MB (model: 14196.02 MB, context: 341.50 MB)
system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1
| SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
once upon a time, there was a kingdom of junk.
their king was named clutter.
he had many wives. the queen was called junk.
her son was named mess.
one day a young prince (me) walked into the castle and found out what he looked like. He immediately knew that he must become king of his own domain so he could get rid of all the clutter and make everything clean and orderly again. The prince set off on his journey and met many interesting characters along the way, from a wise old man who showed him how to keep order in his
kingdom by using the principle of “
llama_print_timings: load time = 640021.06 ms
llama_print_timings: sample time = 74.77 ms / 128 runs ( 0.58 ms per token, 1711.92 tokens per second)
llama_print_timings: prompt eval time = 70.04 ms / 5 tokens ( 14.01 ms per token, 71.39 tokens per second)
llama_print_timings: eval time = 7664.05 ms / 127 runs ( 60.35 ms per token, 16.57 tokens per second)
llama_print_timings: total time = 8012.75 ms
Log end
Prerequisites
Before submitting your question, please ensure the following:
Question Details
Please provide a clear and concise description of your question. If applicable, include steps to reproduce the issue or behaviors you've observed. As the title : Need quite a long time to load the model
Additional Context
Please provide any additional information that may be relevant to your question, such as specific system configurations, environment details, or any other context that could be helpful in addressing your inquiry. I am using wsl2/ubuntu22.04 on windows 11 with a plugin nvidia gpu --- rtx 4060ti.
Thank you !