Closed philwee closed 10 months ago
@philwee tagging the python bindings you shared which should make it much easier to add ggml support:
If someone wants to work on this I’d be happy to give pointers! All that’s required is a new LM subclass akin to #395 .
I may take a look at working on this integration on our end in ~1 month from now, if no one else has started a PR by then.
I can try to work on this, could you give some pointers?
Of course! I’d recommend looking at the PR I linked to get a sense of what the scope might be.
The process would look something like: -make a new file in lm_eval/models called “ggml_model.py” or similar
Lmk if this makes sense!
Carson Poole reports:
ggml is doing the compute in int4 rather than just the weight storage. it's how it can be so much faster than a typical cpu impl because CPUs are more compute bound than GPUs for gemms it's also egregiously slow for long input context lengths. a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input
So it may be worth lowering the priority on this. Of course, implementing it would enable us to better evaluate these claims 🙃
a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input
there exists BLAS support (OpenBLAS, cuBLAS, clblast), which outperforms larger batchsizes of just the simd tuned code. (openblas -> cpu, cublas and clblast -> gpu)
the blas acceleration can already make a difference with single digit batchsizese
edit: also since only the logits are of interest, eval can be done in very large batchsizes (even better for blas)
I asked about this in the ggml library and the https://github.com/ggerganov/ggml/issues/120#issuecomment-1528953671 contained links to several WIP Python bindings. It looks like this one is the best starting point for us.
Personally I think this one is better (no need to call that one a "starting point").
I asked about this in the ggml library and the ggerganov/ggml#120 (comment) contained links to several WIP Python bindings. It looks like this one is the best starting point for us.
Personally I think this one is better (no need to call that one a "starting point").
I saw that, but per the issue at https://github.com/abetlen/llama-cpp-python/issues/71 it appears to be 5x slower than the underlying implantation.
It might be bc it does not build the llama.so/.dll properly / only in 1 configuration. so simd might be disabled. There is also the fact that there is no official BLAS enabled build available anywhere. (see https://github.com/abetlen/llama-cpp-python/issues/117 )
but they are "easy" to fix after the fact, since you can build the llama.dll yourself with the buildoptions that you like and replace the one shipped with the bindings (recommended right now).
@Green-Sky I have almost no experience with C, but if you can do that and demonstrate acceptable speed that works for me.
@StellaAthena If you want to give me a representative test prompt I can compare llama-cpp-python to native llama.cpp
. I also have both a 16 core CPU w/128GB of RAM and a shiny new 3090Ti w/24GB if you need some test cycles.
Here's my (short run comparative) perplexity scores to date with the models I have on hand.
@gjmulder i haven’t had the bandwidth to test it yet, but this PR supports saving the actual predictions to disk: https://github.com/EleutherAI/lm-evaluation-harness/pull/492
you can run Lambada, HellaSwag, and ANLI with a limit of 20. If that ends up identical I think assuming it generalizes is safe. Maybe throw in a math problem too
llama-cpp-python attempts to implement the OpenAI API, so I may look at simply pointing the harness at an instance of llama-cpp-python
and running a few smoke tests.
Sounds great!
Started adding support for a llama-cpp-python server here: https://github.com/EleutherAI/lm-evaluation-harness/pull/617
Courtesy of @matthoffner , lm-eval now supports GGML Llama models via llama-cpp-python
!
Could there be support for ggml added to this soon - 4bit quantized models are said to be pretty decent, but there is no reliable way to test this out. It would be nice if support for it could be added to this.
Thank you!