Support for ggml - Githubissues

philwee commented 1 year ago

Could there be support for ggml added to this soon - 4bit quantized models are said to be pretty decent, but there is no reliable way to test this out. It would be nice if support for it could be added to this.

Thank you!

jon-tow commented 1 year ago

@philwee tagging the python bindings you shared which should make it much easier to add ggml support:

https://github.com/abetlen/llama-cpp-python

haileyschoelkopf commented 1 year ago

If someone wants to work on this I’d be happy to give pointers! All that’s required is a new LM subclass akin to #395 .

I may take a look at working on this integration on our end in ~1 month from now, if no one else has started a PR by then.

philwee commented 1 year ago

I can try to work on this, could you give some pointers?

haileyschoelkopf commented 1 year ago

Of course! I’d recommend looking at the PR I linked to get a sense of what the scope might be.

The process would look something like: -make a new file in lm_eval/models called “ggml_model.py” or similar

in that file make a BaseLM subclass called GGMLLM or similar
This class should do the following:
In initialization, instantiate a model using the Python bindings @jon-tow linked
Implement the loglikelihood_rolling(), loglikelihood(), and greedy_until() class methods to support all 3 completion types (see gpt3.py or BaseLM for a template to compare to)
add any helper methods for those functions!

Lmk if this makes sense!

StellaAthena commented 1 year ago

I asked about this in the ggml library and the Response contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

StellaAthena commented 1 year ago

Carson Poole reports:

ggml is doing the compute in int4 rather than just the weight storage. it's how it can be so much faster than a typical cpu impl because CPUs are more compute bound than GPUs for gemms it's also egregiously slow for long input context lengths. a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input

So it may be worth lowering the priority on this. Of course, implementing it would enable us to better evaluate these claims 🙃

Green-Sky commented 1 year ago

a very unoptimized WebGPU implementation will obliterate ggml's speed on like 500-1000 tokens input

there exists BLAS support (OpenBLAS, cuBLAS, clblast), which outperforms larger batchsizes of just the simd tuned code. (openblas -> cpu, cublas and clblast -> gpu)

the blas acceleration can already make a difference with single digit batchsizese

edit: also since only the logits are of interest, eval can be done in very large batchsizes (even better for blas)

Green-Sky commented 1 year ago

I asked about this in the ggml library and the https://github.com/ggerganov/ggml/issues/120#issuecomment-1528953671 contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

Personally I think this one is better (no need to call that one a "starting point").

StellaAthena commented 1 year ago

I asked about this in the ggml library and the ggerganov/ggml#120 (comment) contained links to several WIP Python bindings. It looks like this one is the best starting point for us.

Personally I think this one is better (no need to call that one a "starting point").

I saw that, but per the issue at https://github.com/abetlen/llama-cpp-python/issues/71 it appears to be 5x slower than the underlying implantation.

Green-Sky commented 1 year ago

It might be bc it does not build the llama.so/.dll properly / only in 1 configuration. so simd might be disabled. There is also the fact that there is no official BLAS enabled build available anywhere. (see https://github.com/abetlen/llama-cpp-python/issues/117 )

Green-Sky commented 1 year ago

but they are "easy" to fix after the fact, since you can build the llama.dll yourself with the buildoptions that you like and replace the one shipped with the bindings (recommended right now).

StellaAthena commented 1 year ago

@Green-Sky I have almost no experience with C, but if you can do that and demonstrate acceptable speed that works for me.

gjmulder commented 1 year ago

@StellaAthena If you want to give me a representative test prompt I can compare llama-cpp-python to native llama.cpp. I also have both a 16 core CPU w/128GB of RAM and a shiny new 3090Ti w/24GB if you need some test cycles.

Here's my (short run comparative) perplexity scores to date with the models I have on hand.

StellaAthena commented 1 year ago

@gjmulder i haven’t had the bandwidth to test it yet, but this PR supports saving the actual predictions to disk: https://github.com/EleutherAI/lm-evaluation-harness/pull/492

you can run Lambada, HellaSwag, and ANLI with a limit of 20. If that ends up identical I think assuming it generalizes is safe. Maybe throw in a math problem too

gjmulder commented 1 year ago

llama-cpp-python attempts to implement the OpenAI API, so I may look at simply pointing the harness at an instance of llama-cpp-python and running a few smoke tests.

StellaAthena commented 1 year ago

Sounds great!

matthoffner commented 1 year ago

Started adding support for a llama-cpp-python server here: https://github.com/EleutherAI/lm-evaluation-harness/pull/617

haileyschoelkopf commented 10 months ago

Courtesy of @matthoffner , lm-eval now supports GGML Llama models via llama-cpp-python!

EleutherAI / lm-evaluation-harness

Support for ggml #417