FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Is LLaMa supported? #60

Open NightMachinery opened 1 year ago

NightMachinery commented 1 year ago

Facebook recently released some smaller LLMs that are trained on more data, and are competitive in many areas with bigger models. Are they supported?

https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

Titaniumtown commented 1 year ago

They're not even opensource and you can't even access it. I don't expect support for a while if ever. Edit: this has changed

Ying1123 commented 1 year ago

Not supported yet. But we think it is a model worth trying because it is smaller than OPT but performs better. @keroro824 is interested in this as well and wants to give it a try.

neuhaus commented 1 year ago

The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!

Titaniumtown commented 1 year ago

The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!

Wish they simply uploaded it to huggingface :(

Titaniumtown commented 1 year ago

The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!

LLaMA is really good, trying it right now. Hope support is added!

ekiwi111 commented 1 year ago

The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!

LLaMA is really good, trying it right now. Hope support is added!

Which model are you using - 7B, 13B, 33B or 65B?

jtang613 commented 1 year ago

The 7B will run on a single GPU, but the other models require multiple. The LLaMA sample code also really wants a lot of VRAM (16GB seems to be bare minimum).

Titaniumtown commented 1 year ago

The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!

LLaMA is really good, trying it right now. Hope support is added!

Which model do use - 7B, 13B, 33B or 65B?

How many gpus?

Ph0rk0z commented 1 year ago

Will see what the 7b does today. https://t.co/QwqDmpdftO

OedoSoldier commented 1 year ago

Just tried 7B, and it ate 22G of my 4090, wow.

Looking forward to FlexGen implementation!

glenneroo commented 1 year ago

To reduce VRAM usage, you can change max_batch_size in the following method in e.g. llama/example.py: ModelArgs(max_seq_len=1024, max_batch_size=32, **params)

These are my test results: max_batch_size VRAM usage
1 13.5GB
2 14GB
4 15GB
8 17GB
16 21.5GB
32 OOM on 3090
neuhaus commented 1 year ago

The Llama-INT8 fork works with the 13B model on a single GPU with approx. 18GB RAM.

jtang613 commented 1 year ago

I'm seeing good results from LLaMA 7B. It seems to produce results comparable to OPT-30b, but noticeably quicker. I've also tried the llama-int8 fork and while it does work, I'm having a hard time getting it (13B int8) to produce better results than the 7B (float).

Ph0rk0z commented 1 year ago

I am SOL because of bitsandbytes not so great pascal support. So flexgen is my only hope for 30b and 13b so far.

IME, trying to chat with these models they sort of fall apart after a bunch of messages. Coherence goes from 100% down to nothing.

MarkSchmidty commented 1 year ago

The 7B will run on a single GPU, but the other models require multiple. The LLaMA sample code also really wants a lot of VRAM (16GB seems to be bare minimum).

  • 7B: 1 GPU
  • 13B: 2 GPU
  • 30B: 4 GPU
  • 65B: 8 GPU

13B is running on one 3090 with int8 here: https://github.com/oobabooga/text-generation-webui/issues/147

Using int8 VRAM usage is reduced to: LLaMA-13B: 16249MiB LLaMA-7B: 9225MiB

and this is before reducing batch size.

MarkSchmidty commented 1 year ago

LLaMA-int8 implementation: https://github.com/tloen/llama-int8

LLaMA CPU implementation: https://github.com/markasoftware/llama-cpu

LLaMA torrent download Magnet link: https://github.com/facebookresearch/llama/pull/73/files

jtang613 commented 1 year ago

The LLaMA CPU implementation seems broken. Anyone else had any success with it? I've tried it on a Xeon with 128GB and AMD 7900X with 64GB. Crashes on RuntimeError: Invalid scalar type each time.

Ph0rk0z commented 1 year ago

Convert to HF format.

practical-dreamer commented 1 year ago

Once LLAMA is converted to HF format can it then be converted to numpy format just like OPT?

Ph0rk0z commented 1 year ago

I think if it was that easy we would have support for a lot of models right now.

MarkSchmidty commented 1 year ago

This project appears to have 4-bit working for LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa

May be helpful here.

xiscoding commented 1 year ago

You need to log into hugging face for this and the format isn't right for the text-generation-webui unfortunately... I didn't get it to actually run though so there is probably a work around

lxe commented 1 year ago

The pre-quantized 4bit llama is working without flexgen but I think perf suffers a bunch. Wonder if flexgen with 8-bit mode is better/faster? Looks like it still doesn't support the llama model yet.

jtang613 commented 1 year ago

Looks like it still doesn't support the llama model yet.

Given it seems to be an academic project, I wouldn't expect a lot of active development or maintenance going forward unless an enthusiastic group forks. They're probably busy on the paper/publish/grant part now.

MarkSchmidty commented 1 year ago

The pre-quantized 4bit llama is working without flexgen but I think perf suffers a bunch. Wonder if flexgen with 8-bit mode is better/faster? Looks like it still doesn't support the llama model yet.

This depends on your hardware. Ada hardware (4xxx) gets higher inference speeds in 4bit than either 16bit or 8bit. The LLaMA GPTQ 4bit implementation in particular is about 20x faster on a 4090 than running in 16bit using the HuggingFace Transformers library and about 9x faster than running in 8bit using bitsandbytes.

chenqy4933 commented 1 year ago

This depends on your hardware. Ada hardware (4xxx) gets higher inference speeds in 4bit than either 16bit or 8bit. The LLaMA GPTQ 4bit implementation in particular is about 20x faster on a 4090 than running in 16bit using the HuggingFace Transformers library and about 9x faster than running in 8bit using bitsandbytes.

@MarkSchmidty LLaMA GPTQ 4bit implementation seems very naive in performace, just shift the bite and do mul, however the tensor core in turing and ampere, it support INT4 in hardware. How about the INT4 matmul implement by tensor core, will it be more fast?

yonatanilanbioz commented 1 year ago

Anyone has any new information on using FlexGen with LlaMa-based models? Maybe it can work with not too many changes to the code?

viktor-ferenczi commented 1 year ago

Llama 2 is out. It also has a very high quality fine-tune jondurbin/airoboros-l2-70b-gpt4-1.4.1 which would be great to use for high throughput background processing.