ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.34k stars 8.77k forks source link

Feature Request: GLM-4 9B Support #7778

Open arch-btw opened 1 month ago

arch-btw commented 1 month ago

Prerequisites

Feature Description

It would be really cool to have support for these models that were released today. They have some very impressive benchmarks. I've also been trying out the model in huggingface spaces myself and noticed it speaks a lot of languages fluently and is knowledgeable on many topics. Thank you for your time.

Here are the download links:

Here is the English README: README_en.md

Motivation

The motivation for this feature are found in some of the technical highlights for this model:

Here are some of the results:

Needle challenge:

eval_needle

Longbench:

longbench

Possible Implementation

We might be able to use some of the code from: https://github.com/ggerganov/llama.cpp/pull/6999.

There is also chatglm.cpp but it doesn't support GLM-4.

foldl commented 1 month ago

You can try chatllm.cpp, which supports GLM-4.

jamfor352 commented 1 month ago

You can try chatllm.cpp, which supports GLM-4.

Can confirm this works and is cool 😎

It would be good to get this functionality in Llama.cpp too, if only for the GPU acceleration

ELigoP commented 4 weeks ago

You can try chatllm.cpp, which supports GLM-4.

Well, it chatllm.cpp is CPU-only. Why not trying transformers version in fp16.

llama.cpp GPU support for GLM-4 would be great, and then quantized versions will appear, which will be even more comfortable to run.

This GLM-4 looks like comparable or beating LLama 3, maybe even best-in-class for now.

matteoserva commented 2 weeks ago

We might have this feature soon: https://github.com/ggerganov/llama.cpp/pull/8031