ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.88k stars 9.73k forks source link

Dynamic matrix tile quantization and fine-tune w vector approximation. #4176

Closed chadbrewbaker closed 7 months ago

chadbrewbaker commented 11 months ago

Feature Description

This paper came out a few days ago: LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning - GitHub.

Quantize each matrix tile to best precision, use few high precision vectors to approximate matrix for fine-tuning, and further data aware optimizations.

Motivation

Push boundary of what can be hosted in the browser and on consumer hardware. Data-aware tuning. Linux sandbox as oracle.

Possible Implementation

[ ] Test rig around original Python code. [ ] Data structure to store quantized tiles and vector approximations. @clattner mentioned something about "bubbles"? [ ] QEMU tests - JSLinux sandbox in web browser examples.

Nice to have:

[ ] Implement @geohot bounty to compare llama.cpp w TinyGrad to diff roundoff errors so we aren't flying blind. [ ] Data-aware tuning. Fisher as in paper, tuning float intrinsics on value bounded data, linting values that should be constant. [ ] Weightwatcher (Empirical Spectral Density) plots to compare models. Ideally an entire Jupyter notebook of charts to compare two models at a glance. [ ] Fuzzing like Berger'z COZ to elicit bottlenecks - low bits of tiles, tile precision size, add delay in operations, increase buffer sizes, lower cgroup permissions. [ ] Tile size autotune benchmark and general profile guided optimization. Probably add Mojo target in Makefile for comparison. [ ] Scheduler auto-tuning for large multicore CPU. Embedded Linux config files to lower jitter like IBM Blue Gene/L. [ ] Zstd dictionary to compact idle data structures - compression ratio also useful for performance linting. [ ] Quadtree of high precision tiles? Two-pass might be faster, do everything in low precision then second sparse pass w high precision. Perhaps use Ultra Fast BERT - Exponentially Faster Language Modeling to look at sparse tradeoffs. [] Q* pass - given a "page" of corpus ask probing questions about it and add that feedback to the training data.

chadbrewbaker commented 11 months ago

LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition - seems you can compose small low rank fine-tunes.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.