abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.17k stars 974 forks source link

use n_threads param to call _embed_image_bytes fun #1834

Open KenForever1 opened 1 week ago

KenForever1 commented 1 week ago

Hello, when I run openbmb/MiniCPM-V-2_6-gguf model, I find llama-cpp-python as a server is slower than llama_cpp's example of minicpmv-cli. I find the diffrerence is llama-cpp-python's _embed_image_bytes func is called with param of n_threads_batch. But llama-cpp's example of minicpmv-cli use n_threads (which value is cpu_cores / 2), when call llava_image_embed_make_with_bytes func. The param n_threads make image process more efficient and less time-consuming.

For example, on my CPU (56 cores), it takes more than three times the time.

This parameter affects the time consumption function as follows:

bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs, float * vec) {
    ggml_backend_graph_compute(ctx->backend, gf);
......
}

Best wishes.