ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.01k stars 9.32k forks source link

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

Closed casper-hansen closed 10 months ago

casper-hansen commented 1 year ago

MosaicML released its MPT 30B version today with 8k context, with Apache 2.0 license.

image

Why you should support MPT 30B

Let me present my argumentation for why MPT should be supported including CUDA support. Arguably, LLaMa models or Falcon models are great on paper and in evaluation, but what they really lack is commercial licensing (in the case of LLaMa) and an actively maintained tech stack (in the case of Falcon).

Tech stack:

  1. MosaicML has 8 employees actively contributing to their own open-source repo LLM-Foundry and another few researching for improvements. Recently they upgraded to PyTorch 2.0 and added H100 support just before this 30B version was released.
  2. A streaming library; train and fine-tune models while streaming your dataset from S3/GCP/Azure data storage options. This reduces costs at train time and you can easily resume upon hardware failures.
  3. They have developed tools like Composer that lets you train and fine-tune models much faster (e.g. GPT-2 for roughly $145 with Composer, and $255 with vanilla PyTorch).

Performance:

Evaluation: The performance on generic benchmarks of LLaMa 33B, Falcon 40B, and MPT 30B is mostly the same. Although MPT 30B is the smallest model, the performance is incredibly close, and the difference is negligible except for HumanEval where MPT 30B (base) scores 25%, LLaMa 33B scores 20%, while Falcon scores 1.2% (did not generate code) in MPTs tests.

Inference speed: The inference speed of MPT models is roughly 1.5-2.0x faster than LLaMa models because of FlashAttention and Low Precision Layernorm.

Memory usage: The MPT 30B model fits on 1x A100-80GB at 16 bits. Falcon 40B requires 85-100GB VRAM at 16 bits which means it conventionally needs 2x GPUs without the use of quantization.

Cost:

LLaMa is roughly 1.44x more expensive and Falcon 1.27x more expensive in compute power used to train the full models. This is remarkable because it means the MPT models can achieve the same performance as more expensive models at a much lower cost.

MPT-30B FLOPs ~= 6 30e9 [params] 1.05e12 [tokens] = 1.89e23 FLOPs LLaMa-30B FLOPs ~= 6 32.5e9 [params] 1.4e12 [tokens] = 2.73e23 FLOPs (1.44x more) Falcon-40B FLOPs ~= 6 40e9 [params] 1e12 [tokens] = 2.40e23 FLOps (1.27x more)

Conclusion

If the community decides to support MPT models with CUDA support, we gain the following benefits:

  1. Being able to train and fine-tune LLMs at a lower cost than LLaMa models and enable commercial usage using llama.cpp/ggml for inference.
  2. Faster LLMs compared to LLaMa. Even faster once quantized and CUDA support is enabled.
  3. Much larger default context size (8k vs 2k), but also the ability to extend context size using ALiBi.

Links

https://www.mosaicml.com/blog/mpt-30b https://huggingface.co/mosaicml/mpt-30b

TheBloke commented 1 year ago

I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible.

PS. FYI, KoboldCpp release 1.32 has now added OpenCL acceleration for MPT, as well as GPT-2 (StarCoder), GPT-J and GPTNeoX.

I tested my MPT 30B Instruct and Chat GGML uploads with it earlier and it's working pretty well - 8 tokens/s on shorter responses.

(But I'd still love llama.cpp to support this and other model types, and eventually bring CUDA and Metal acceleration to them.)

casper-hansen commented 1 year ago

I'll just add my usual 2c on this subject: I would love if llama.cpp supported all major model types, bringing its hundreds of wonderful features to as many models as possible.

Yes, this is it! Would love to see us start with MPT as it contains quite a few features that other models also use. Supporting MPT models also means supporting Replit models since Replit chose LLM Foundry.

8 tokens/s on shorter responses.

Not sure what your setup looks like, but sounds like there is lots of room for improvement if we add CUDA acceleration in llama.cpp. I remember LLaMa 33B running at 29 tokens/s on your 4090 + i9-13900K rig. My bet is that MPT 30B could run faster than that if we give it full optimization.

bratao commented 1 year ago

It is possible to test it here: https://huggingface.co/spaces/mosaicml/mpt-30b-chat

The results are impressive!

JohannesGaessler commented 1 year ago

Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest).

slaren commented 1 year ago

a binary that runs an exported ggml graph using CUDA

I am working on something similar, but it will be at least a few weeks until it can be merged.

JohannesGaessler commented 1 year ago

If you do it that's fine with me too.

casper-hansen commented 1 year ago

Too much work. Maybe once I get around to writing a binary that runs an exported ggml graph using CUDA (realistically in a few months at the earliest).

A binary execution graph would be amazing. But yes, large task.

My biggest wish is that repos like llama/ggml can enable optimized inference for quantized models that are commercially available. There is not a lot of tech that can do that right now.

CyborgArmy83 commented 1 year ago

Would be amazing if MPT and Falcon support could be build-in!

casper-hansen commented 1 year ago

MosaicML (MPT creators) was just acquired by Databricks for $1.3B, so I expect more initiatives for LLMs. Even more of an argument to start supporting their Foundry models.

@slaren since you said you will have it ready in a few weeks, I wanted to ask you the following. Do you see the path to supporting most models to export to a graph to run CUDA execution? It would be huge to have this kind of support native for most popular models.

sirajperson commented 1 year ago

@casperbh96 That's crazy. I hope the don't change the policies with the MPT series.

slaren commented 1 year ago

Do you see the path to supporting most models to export to a graph to run CUDA execution?

That's the goal in the long run. At first, some of the operations required by non-llama models may be missing a CUDA implementation, but eventually we should add everything that is needed to support the different models.

maddes8cht commented 11 months ago

anything happening on this? now as the new gguf format is well established and stable, wasn't the idea to implement new models easier?

casper-hansen commented 11 months ago

Looks like the authors do not have a plan to support MPT models

maddes8cht commented 11 months ago

There needs to be just one developer capable and intetested enough to start it...

ggerganov commented 11 months ago

We now kind of have a process for adding new models to llama.cpp (see Falcon, StarCoder and Baichuan). Looking for contributions to do something similar for MPT

maddes8cht commented 10 months ago

and there is a complete stack of akl original mpt models quantized to gguf at maddes8cht huggingface with an own mpt-collection. Finetuned mpt models will follow next.

Galunid commented 10 months ago

closed in #3417

AlexBlack2202 commented 9 months ago

original mpt models quantized to gguf at

Hi,

how can you convert from mpt to gguf

i have an isssue when run convert-hf-to-gguf.py with the lasted version of gguf and torch==2.1.1, transformers==4.35.2

"Can not map tensor 'transformer.wpe.weight'"

looking for some help