falconindy / pkgfile

an alpm .files metadata explorer
MIT License
96 stars 20 forks source link

`pkgfile -b` is slow #62

Open wpyoga opened 2 years ago

wpyoga commented 2 years ago

A single invocation of pkgfile -b takes around 230 ms on my laptop (Ryzen 5 4650U, 40GB DDR4, 512GB NVMe SSD).

After playing around with different binary names to search, with binary names at the start of the data files, and at the end of the data files, the invocation time doesn't differ by much. Therefore I'm guessing that most of the time is spent on loading the cache files.

I would like to propose a faster storage format, maybe using a hash index, to bring down the search time.

$ time pkgfile -b abiword
extra/abiword

real    0m0.237s
user    0m0.335s
sys 0m0.036s
$ time for i in {1..10}; do pkgfile -b abiword >/dev/null; done
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword

real    0m2.280s
user    0m3.185s
sys 0m0.294s

This is my installation:

NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux-logo

These are the pkgfile cache files:

$ ls -lh
total 361M
-rw-r--r-- 1 root root 258M Sep 13 19:45 community.files
-rw-r--r-- 1 root root 8.9M Sep 13 19:45 core.files
-rw-r--r-- 1 root root  95M Sep 13 19:45 extra.files
falconindy commented 3 months ago

I should do some proper profiling here, but what you're asking for only benefits a small fraction of potential queries. Yes, a hash table could benefit queries like 'pacman' or '/usr/bin/pacman', but you're back to a full scan for anything involving globbing or regex. It does, admittedly, also help with things like #27.

An embedded DB like SQLite might work here to provide a more flexible query language with indexing capabilities. That would be generally useful, but also presents a pretty substantial rewrite of the existing code. I guess that's sort of true of any storage format change...