Cuda performance broadcast

cmp-nct commented 1 year ago

Biggest changes:

Added 16 bit dequantization kernels for all quantizers K and normal (tested a ton but likely not every case)
Added a not-super-clean hack to split cuBLAS off into 16 bit cuBLAS, saving 50% VRAM and hundreds of GB vram copy operations.
- This results in 50% size reduction of temporary cuBLAS buffers which previously were 32 bit (up to 4 GB VRAM saved )
Optimized cuda memory pool with an "access counter" that can be reset and purged. This efficiently removes any unused buffers during evaluation.
- saves 0.5-1.0 GB vram when using non pretty tensor shapes or batched prompt processing
Interleaving broadcast patch (jploski) which almost eliminates the token slowdown

Medium changes:

During model load (it should work for quantization write too, untested) the mangled "Wizard" tensor is re-shaped to the correct format (allowing efficient operations)
VRAM calculation was re-done, the 16 bit kernels and the buffer optimization save a lot of VRAM and make the calculation reliable (I think)
for non batched processing we have only 50MB of VRAM overhead now
Moved CUDA initialization into a background thread - speeds loading up by 100% (3 seconds) on my system

Small changes:

Added a performance metric into ggml_tensor->meta and visualize it into graph that shows if cuBLAS was used and which type of cuBLAS
Changed reserved VRAM parameter into signed (so you can force VRAM swapping now by a negative reserved amount) (this allows 4.1 bit 7B falcon on a 24 GB card)
Added the latest K/V optimization of @jploski
Reworked most printed tables, they are mostly aligned to each other now looking better
ggml_tensors always populate the host RAM data pointer now
added a reliable physical CPU core routine for windows
changed the automated thread number to a high performance selection
allows n_batch to be larger than 512 now (bit more than 1024)

cmp-nct commented 1 year ago

At 1000 tokens on single GPU I have these speeds now:

40/second for 7B
17/second for 40B At around 50 tokens:
55/second for 7B
24/second for 40B (4090 using 4K quantization and squeezing it into VRAM using negative reserved config)

That is already quite respectable

jploski commented 1 year ago

At 1000 tokens on single GPU I have these speeds now:
* 40/second for 7B

* 17/second for 40B
  At around 50 tokens:

* 55/second for 7B

* 24/second for 40B (4090 using 4K quantization and squeezing it into VRAM using negative reserved config)
That is already quite respectable

Around here I call it deeply impressive. :-)

cmp-nct / ggllm.cpp

Cuda performance broadcast #32