ggerganov / llama.cpp

LLM inference in C/C++
MIT License
62.29k stars 8.94k forks source link

Bug: similar sizes suggest some heavy shared component in all 38 `llama-*` binaries (which now weigh 14 GB in total) #8080

Open mirekphd opened 1 month ago

mirekphd commented 1 month ago

What happened?

I've compiled the latest build of llama-server (version b3205) using the method recommended in the docs.

The recent renaming of server to llama-server has tempted me to use more than just server and llama-cli, but all (38) the available server-* binaries that are being built by default.

They are very heavy though - they add up to nearly 14 GB. They also seem to have nearly identical sizes, suggesting some common heavy dependency or component that can be perhaps deployed separately and re-used by them all?

I also noticed that 100 MB (or 40%) has been added to each of these files over the last 10 days alone (judging from their common size due to a suspected shared component and using the server binary size alone - here renamed to llama.cpp.server) :

$ docker run --rm -it --name test latestml/ml-gpu-py311-cuda118-infer:20240613 bash -c "du -h /usr/bin/llama* | sort -h"
255M    /usr/bin/llama.cpp.server 

Name and Version

# note: llama-server and other llama-* binaries were built from the `b3205` release code bundle

$ llama-cli --version
llama-cli: /opt/conda/lib/libcurl.so.4: no version information available (required by llama-cli)
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

$ du -sch /usr/bin/llama-* | sort -h
360M    /usr/bin/llama-benchmark-matmult
360M    /usr/bin/llama-export-lora
360M    /usr/bin/llama-gguf
360M    /usr/bin/llama-q8dot
360M    /usr/bin/llama-vdot
361M    /usr/bin/llama-convert-llama2c-to-ggml
361M    /usr/bin/llama-quantize-stats
362M    /usr/bin/llama-baby-llama
362M    /usr/bin/llama-batched
362M    /usr/bin/llama-batched-bench
362M    /usr/bin/llama-bench
362M    /usr/bin/llama-cli
362M    /usr/bin/llama-cvector-generator
362M    /usr/bin/llama-embedding
362M    /usr/bin/llama-eval-callback
362M    /usr/bin/llama-finetune
362M    /usr/bin/llama-gbnf-validator
362M    /usr/bin/llama-gguf-split
362M    /usr/bin/llama-gritlm
362M    /usr/bin/llama-imatrix
362M    /usr/bin/llama-infill
362M    /usr/bin/llama-llava-cli
362M    /usr/bin/llama-lookahead
362M    /usr/bin/llama-lookup
362M    /usr/bin/llama-lookup-create
362M    /usr/bin/llama-lookup-merge
362M    /usr/bin/llama-lookup-stats
362M    /usr/bin/llama-parallel
362M    /usr/bin/llama-passkey
362M    /usr/bin/llama-perplexity
362M    /usr/bin/llama-quantize
362M    /usr/bin/llama-retrieval
362M    /usr/bin/llama-save-load-state
362M    /usr/bin/llama-server
362M    /usr/bin/llama-simple
362M    /usr/bin/llama-speculative
362M    /usr/bin/llama-tokenize
362M    /usr/bin/llama-train-text-from-scratch
14G total
slaren commented 1 month ago

There is a large common dependency being shared by all of these binaries, which is llama.cpp itself. It is possible to build llama.cpp as a shared library. This is done for example in the windows builds, and most executables and below 1MB.

dspasyuk commented 1 month ago

@mirekphd Strange, mine are at around 22 Mb on Ubuntu 20M ./llama-benchmark-matmult 20M ./llama-export-lora 20M ./llama-gguf 20M ./llama-q8dot 20M ./llama-vdot 21M ./llama-convert-llama2c-to-ggml 21M ./llama-quantize-stats 22M ./llama-baby-llama 22M ./llama-batched 22M ./llama-batched-bench 22M ./llama-bench 22M ./llama-cli 22M ./llama-cvector-generator 22M ./llama-embedding 22M ./llama-eval-callback 22M ./llama-finetune 22M ./llama-gbnf-validator 22M ./llama-gguf-split 22M ./llama-gritlm 22M ./llama-imatrix 22M ./llama-infill 22M ./llama-llava-cli 22M ./llama-lookahead 22M ./llama-lookup 22M ./llama-lookup-create 22M ./llama-lookup-merge 22M ./llama-lookup-stats 22M ./llama-parallel 22M ./llama-passkey 22M ./llama-perplexity 22M ./llama-quantize 22M ./llama-retrieval 22M ./llama-save-load-state 22M ./llama-simple 22M ./llama-speculative 22M ./llama-tokenize 22M ./llama-train-text-from-scratch 23M ./llama-server 809M total

slaren commented 1 month ago

CUDA builds are a lot bigger due to all the kernels.

dspasyuk commented 1 month ago

@slaren Mine is a CUDA build too: ldd llama-cli linux-vdso.so.1 (0x00007fff26938000) libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x0000747931a00000) libcublas.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12 (0x000074792b200000) libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x000074792ae00000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000074792aa00000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000747933719000) libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007479336cf000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007479336af000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074792a600000) /lib64/ld-linux-x86-64.so.2 (0x0000747934dab000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000747934d87000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000747934d82000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000747934d7d000) libcublasLt.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12 (0x0000747907c00000)

slaren commented 1 month ago

The docker images use CUDA_DOCKER_ARCH=all, so there is a copy of the kernels for each architecture. make uses the native architectures, and cmake builds only for 3 architectures.