Open NightMachinery opened 1 year ago
They're not even opensource and you can't even access it. I don't expect support for a while if ever. Edit: this has changed
Not supported yet. But we think it is a model worth trying because it is smaller than OPT but performs better. @keroro824 is interested in this as well and wants to give it a try.
The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!
The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!
Wish they simply uploaded it to huggingface :(
The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!
LLaMA is really good, trying it right now. Hope support is added!
The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!
LLaMA is really good, trying it right now. Hope support is added!
Which model are you using - 7B, 13B, 33B or 65B?
The 7B will run on a single GPU, but the other models require multiple. The LLaMA sample code also really wants a lot of VRAM (16GB seems to be bare minimum).
The first download links have been distributed by Facebook earlier today. I've already seen a torrent on twitter - oh my!
LLaMA is really good, trying it right now. Hope support is added!
Which model do use - 7B, 13B, 33B or 65B?
How many gpus?
Will see what the 7b does today. https://t.co/QwqDmpdftO
Just tried 7B, and it ate 22G of my 4090, wow.
Looking forward to FlexGen implementation!
To reduce VRAM usage, you can change max_batch_size in the following method in e.g. llama/example.py:
ModelArgs(max_seq_len=1024, max_batch_size=32, **params)
These are my test results: max_batch_size | VRAM usage |
---|---|
1 | 13.5GB |
2 | 14GB |
4 | 15GB |
8 | 17GB |
16 | 21.5GB |
32 | OOM on 3090 |
The Llama-INT8 fork works with the 13B model on a single GPU with approx. 18GB RAM.
I'm seeing good results from LLaMA 7B. It seems to produce results comparable to OPT-30b, but noticeably quicker. I've also tried the llama-int8 fork and while it does work, I'm having a hard time getting it (13B int8) to produce better results than the 7B (float).
I am SOL because of bitsandbytes not so great pascal support. So flexgen is my only hope for 30b and 13b so far.
IME, trying to chat with these models they sort of fall apart after a bunch of messages. Coherence goes from 100% down to nothing.
The 7B will run on a single GPU, but the other models require multiple. The LLaMA sample code also really wants a lot of VRAM (16GB seems to be bare minimum).
- 7B: 1 GPU
- 13B: 2 GPU
- 30B: 4 GPU
- 65B: 8 GPU
13B is running on one 3090 with int8 here: https://github.com/oobabooga/text-generation-webui/issues/147
Using int8 VRAM usage is reduced to: LLaMA-13B: 16249MiB LLaMA-7B: 9225MiB
and this is before reducing batch size.
LLaMA-int8 implementation: https://github.com/tloen/llama-int8
LLaMA CPU implementation: https://github.com/markasoftware/llama-cpu
LLaMA torrent download Magnet link: https://github.com/facebookresearch/llama/pull/73/files
The LLaMA CPU implementation seems broken. Anyone else had any success with it? I've tried it on a Xeon with 128GB and AMD 7900X with 64GB. Crashes on RuntimeError: Invalid scalar type
each time.
Convert to HF format.
Once LLAMA is converted to HF format can it then be converted to numpy format just like OPT?
I think if it was that easy we would have support for a lot of models right now.
This project appears to have 4-bit working for LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa
May be helpful here.
You need to log into hugging face for this and the format isn't right for the text-generation-webui unfortunately... I didn't get it to actually run though so there is probably a work around
The pre-quantized 4bit llama is working without flexgen but I think perf suffers a bunch. Wonder if flexgen with 8-bit mode is better/faster? Looks like it still doesn't support the llama model yet.
Looks like it still doesn't support the llama model yet.
Given it seems to be an academic project, I wouldn't expect a lot of active development or maintenance going forward unless an enthusiastic group forks. They're probably busy on the paper/publish/grant part now.
The pre-quantized 4bit llama is working without flexgen but I think perf suffers a bunch. Wonder if flexgen with 8-bit mode is better/faster? Looks like it still doesn't support the llama model yet.
This depends on your hardware. Ada hardware (4xxx) gets higher inference speeds in 4bit than either 16bit or 8bit. The LLaMA GPTQ 4bit implementation in particular is about 20x faster on a 4090 than running in 16bit using the HuggingFace Transformers library and about 9x faster than running in 8bit using bitsandbytes.
This depends on your hardware. Ada hardware (4xxx) gets higher inference speeds in 4bit than either 16bit or 8bit. The LLaMA GPTQ 4bit implementation in particular is about 20x faster on a 4090 than running in 16bit using the HuggingFace Transformers library and about 9x faster than running in 8bit using bitsandbytes.
@MarkSchmidty LLaMA GPTQ 4bit implementation seems very naive in performace, just shift the bite and do mul, however the tensor core in turing and ampere, it support INT4 in hardware. How about the INT4 matmul implement by tensor core, will it be more fast?
Anyone has any new information on using FlexGen with LlaMa-based models? Maybe it can work with not too many changes to the code?
Llama 2 is out. It also has a very high quality fine-tune jondurbin/airoboros-l2-70b-gpt4-1.4.1 which would be great to use for high throughput background processing.
Facebook recently released some smaller LLMs that are trained on more data, and are competitive in many areas with bigger models. Are they supported?
https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/