Hello, when I run openbmb/MiniCPM-V-2_6-gguf model, I find llama-cpp-python as a server is slower than llama_cpp's example of minicpmv-cli.
I find the diffrerence is llama-cpp-python's _embed_image_bytes func is called with param of n_threads_batch. But llama-cpp's example of minicpmv-cli use n_threads (which value is cpu_cores / 2), when call llava_image_embed_make_with_bytes func. The param n_threads make image process more efficient and less time-consuming.
For example, on my CPU (56 cores), it takes more than three times the time.
This parameter affects the time consumption function as follows:
Hello, when I run openbmb/MiniCPM-V-2_6-gguf model, I find llama-cpp-python as a server is slower than llama_cpp's example of minicpmv-cli. I find the diffrerence is llama-cpp-python's _embed_image_bytes func is called with param of n_threads_batch. But llama-cpp's example of minicpmv-cli use n_threads (which value is cpu_cores / 2), when call llava_image_embed_make_with_bytes func. The param n_threads make image process more efficient and less time-consuming.
For example, on my CPU (56 cores), it takes more than three times the time.
This parameter affects the time consumption function as follows:
Best wishes.