ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.66k stars 9.42k forks source link

if use MoE + Ternary, what's happen? #5870

Closed qwas982 closed 5 months ago

qwas982 commented 7 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Feature Description

if use MoE + Ternary, what's happen?

Motivation

Mixtral 8x7B MoE LLM and BitNet b1.58 Ternary {-1 0 1} 1bit LLM, Combination them, What kind of spark it will produce?

Possible Implementation

640 gigabytes of RAM to run 2.6 trillion parameters?

All told, 256 gigabytes of RAM can run 1 trillion gigabytes on a giant model.

Now MOE model effect 8X7B = 56 billion parameters can be achieved, model performance GPT3.5,

with this ternary weighting technique, is 16GB of memory possible?

Incredible,

it seems that it is no longer a luxury to deploy a large model of GPT3.5 performance locally on a regular computer or mobile phone,

it doesn't matter if you play the token slowly, at least it can be used locally. What matters is that you can use autoGPT freely to complete the project.

Wait, GPT 3.5 runs on Python, which means there's a lot of room for improvement,

with LLAMACPP implementation, or Rust + WASM implementation, wouldn't native deployment performance take off?

https://arxiv.org/abs/2402.17764 ![Uploading v2-659c28e3b8cbfcb1a0a440a56a9b78ab_r.png…]()

FSSRepo commented 6 months ago

It could work but the model's intelligence will be that of a 7B-sized one as the precision reduction is too much.

paperdev-code commented 6 months ago

A prime example of what Github issues are definitely not to be used for.

sorasoras commented 6 months ago

It could work but the model's intelligence will be that of a 7B-sized one as the precision reduction is too much.

BitNet b1.58 paper stated that it perform about the same as FP16 when it reach about 3B parameters. precision don't matter as you think it seems in transformer only matter when you quant from FP16 transformer

@qwas982 the point of moe is just run token gen much faster. it would just be a faster.

qwas982 commented 6 months ago

It could work but the model's intelligence will be that of a 7B-sized one as the precision reduction is too much.

BitNet b1.58 paper stated that it perform about the same as FP16 when it reach about 3B parameters. precision don't matter as you think it seems in transformer only matter when you quant from FP16 transformer

@qwas982 the point of moe is just run token gen much faster. it would just be a faster.

So that's how it is. I get it.

Thank you.

what's about the Ternary weights?

qwas982 commented 6 months ago

It could work but the model's intelligence will be that of a 7B-sized one as the precision reduction is too much.

But the paper says there's no reduction in intelligence. https://arxiv.org/abs/2402.17764

johnwick123f commented 6 months ago

@qwas982 yeah that’s why bitnet is pretty impressive. Bitnet does not convert a existing model like mixtral or llama to a 1.58 bit model since that will have huge quality losses no matter what.

In bitnet, you have to train the model fully from scratch which will take months and many gpus as well. Much less than normal but still a lot.

Also most likely gpt 3.5 does not run in python and it’s also not open source.

Gpt 3.5 is a huge model and it will be incredibly slow if it just ran on python. However if you use chatgpt, it’s usually 20 tokens per second and that’s incredibly fast for such a large model so they are using c++ probably.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.