Open mirekphd opened 1 month ago
There is a large common dependency being shared by all of these binaries, which is llama.cpp itself. It is possible to build llama.cpp as a shared library. This is done for example in the windows builds, and most executables and below 1MB.
@mirekphd Strange, mine are at around 22 Mb on Ubuntu 20M ./llama-benchmark-matmult 20M ./llama-export-lora 20M ./llama-gguf 20M ./llama-q8dot 20M ./llama-vdot 21M ./llama-convert-llama2c-to-ggml 21M ./llama-quantize-stats 22M ./llama-baby-llama 22M ./llama-batched 22M ./llama-batched-bench 22M ./llama-bench 22M ./llama-cli 22M ./llama-cvector-generator 22M ./llama-embedding 22M ./llama-eval-callback 22M ./llama-finetune 22M ./llama-gbnf-validator 22M ./llama-gguf-split 22M ./llama-gritlm 22M ./llama-imatrix 22M ./llama-infill 22M ./llama-llava-cli 22M ./llama-lookahead 22M ./llama-lookup 22M ./llama-lookup-create 22M ./llama-lookup-merge 22M ./llama-lookup-stats 22M ./llama-parallel 22M ./llama-passkey 22M ./llama-perplexity 22M ./llama-quantize 22M ./llama-retrieval 22M ./llama-save-load-state 22M ./llama-simple 22M ./llama-speculative 22M ./llama-tokenize 22M ./llama-train-text-from-scratch 23M ./llama-server 809M total
CUDA builds are a lot bigger due to all the kernels.
@slaren Mine is a CUDA build too: ldd llama-cli linux-vdso.so.1 (0x00007fff26938000) libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x0000747931a00000) libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x000074792b200000) libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x000074792ae00000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000074792aa00000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000747933719000) libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007479336cf000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007479336af000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074792a600000) /lib64/ld-linux-x86-64.so.2 (0x0000747934dab000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000747934d87000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000747934d82000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000747934d7d000) libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x0000747907c00000)
The docker images use CUDA_DOCKER_ARCH=all
, so there is a copy of the kernels for each architecture. make
uses the native architectures, and cmake
builds only for 3 architectures.
What happened?
I've compiled the latest build of
llama-server
(version b3205) using the method recommended in the docs.The recent renaming of
server
tollama-server
has tempted me to use more than just server andllama-cli
, but all (38) the available server-* binaries that are being built by default.They are very heavy though - they add up to nearly 14 GB. They also seem to have nearly identical sizes, suggesting some common heavy dependency or component that can be perhaps deployed separately and re-used by them all?
I also noticed that 100 MB (or 40%) has been added to each of these files over the last 10 days alone (judging from their common size due to a suspected shared component and using the
server
binary size alone - here renamed tollama.cpp.server
) :Name and Version
What operating system are you seeing the problem on?
Linux
Relevant log output