Closed mitchellgordon95 closed 9 months ago
Hi Mitchell!
Currently, this is expected, but we are aware of the issues, and we plan to solve the issues that can be resolved in future releases.
To summarize the issues:
I have not done precise benchmarks, but if I distributed a weight of 1.0 for all these issues in terms of which one slows the system down the most, this would be my guess: (1) 10%, (2) 20%, (3) 60%, (4) 10%.
In other words, the most effective would be a custom kernel for fp16 matmul, followed by a fp16 matmul done in a second stream, followed by a better CUDA kernel for outlier extraction, and then hard ware issues (not solvable).
Thanks Tim! Looking forward to future releases. Feel free to close or leave open, whichever seems more appropriate.
Hi @mitchellgordon95 !
Thanks for your interest in the feature 💪
Just out of curiosity and if you have time, could you try to run your benchmark with model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True, int8_threshold=0)
? I think that you may observe similar performance than fp16 model in terms of latency but not sure.
Hi Younes!
That did decrease the latency, but it is still around 6.1s which is still almost double the latency without int8.
That is very good to know! Thank you very much @mitchellgordon95 🙏
I would have expected to be faster for GPT-J. But that is great feedback, and this then will be one of my cornerstone models for benchmarking. Thank you, Mitchell!
We analyzed the use case and found issues that we could partially resolve, speeding up smaller models by 2x. Please give the newest release, 0.32.0, another try. You should still see some slowness but it should be much improved already.
The slowness was not related to what we were thinking and stems from the small amount of compute that is done during token-by-token inference compared to how much overhead there is. The main overhead came from bias computation which was fused in PyTorch case but was not fused in bitsandbytes
. We fixed this issue in the most recent release.
Another source of slowness was retrieving a pointer from PyTorch storage that is needed for CUDA functions.
Further sources are as follows:
torch.zeros(...)
instead of torch.empyt()
. Fixing these other sources of slowness will happen over the next weeks and should give another 2x acceleration for small models.
Good to know. You are doing great job. So is it now faster or slower than fp16 for GPT-J case?
I will try in few days myself. So far i could not get T5 working with this.
Thanks for the update, Tim!
I'm now seeing around 3.1s without quantization, 9.3s with load_in_8bit=True
, and 5.7s with load_in_8bit=True,int8_threshold=0
. So definitely better, but still room for improvement. (Compare with 12s / 6.1s previously.)
Thank you, Mitchell! The new performance data looks good and will help us to calibrate. We will keep you updated as we make progress. We are currently planning to support older GPUs and then improve performance. So likely, it will take some time for the next performance improvements to trickle in, but it is on our roadmap.
For me it takes around 250 seconds to generate 1000 words on RTX 3090, when using 8bit without ,int8_threshold=0. When using ,int8_threshold=0, the generation time is 88 seconds. For 500 words sequence, without int8_threshold=0 it takes 53 seconds, while with it takes 22 seconds.
So in general int8_threshold=0 makes it 2-3 times faster. Memory usage is around 8-9GB,
It is awesome you made this. Chinese GLM even works on 4 bits.
https://github.com/THUDM/GLM-130B
It seem to be the best language model so far.
This problem seems to still exist?
Hi, We recently tested codgen 2b model with DJL and DeepSpeed as backend engine. With the latest version of bitnbytes (0.40+), CUDA 11.x, on 20GB A100 MiG, In Deepspeed FLOPs profiler we found following logs:
8bit:
INFO PyProcess [1,0]<stdout>:fwd flops per GPU: 263.13 M
INFO PyProcess [1,0]<stdout>:fwd flops of model = fwd flops per GPU * mp_size: 263.13 M
INFO PyProcess [1,0]<stdout>:fwd latency: 122.43 ms
INFO PyProcess [1,0]<stdout>:fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 2.15 GFLOPS
16bit:
INFO PyProcess [1,0]<stdout>:fwd MACs per GPU: 2.65 GMACs
INFO PyProcess [1,0]<stdout>:fwd flops per GPU: 5.3 G
INFO PyProcess [1,0]<stdout>:fwd flops of model = fwd flops per GPU * mp_size: 5.3 G
INFO PyProcess [1,0]<stdout>:fwd latency: 75.68 ms
INFO PyProcess [1,0]<stdout>:fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 69.98 GFLOPS
Do let us know if there is anything we can do to help or debug this further..
This problem seems to still exist?
Continued attention
This problem seems to still exist when I test Llama.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I still observe this issue when loading heegyu/TinyLlama-augesc-context model from huggingface in 4-bit and 8-bit. Average inference time over 100 runs: (similar results irrespective of the llm_int8_threshold)
Code to reproduce:
from sklearn.metrics import accuracy_score from datasets import load_dataset from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig import time import pandas as pd import torch dataset = load_dataset("heegyu/augesc")
label_map = { "Question":0, "Restatement or Paraphrasing":1, "Reflection of feelings":2, "Self-disclosure":3, "Affirmation and Reassurance":4, "Providing Suggestions":5, "Information":6, "Others":7 }
x = [] y_true = [] for sample in dataset['test']: for row in sample['dialog']: text = row['text'] label = row['strategy']
if label != None:
x.append(text)
y_true.append(label_map[label])
x = x[0:1000] y_true = y_true[0:1000] model_id = "heegyu/TinyLlama-augesc-context"
bnb_config = BitsAndBytesConfig(load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id, quantization_config=bnb_config)
param_size = 0 for param in model.parameters(): param_size += param.nelement() param.element_size() buffer_size = 0 for buffer in model.buffers(): buffer_size += buffer.nelement() buffer.element_size()
model_size = (param_size + buffer_size) / 1024**2 print('Base Model size: {:.3f}MB'.format(model_size))
y_pred = [] times = [] for current_x, current_y in zip(x, y_true): inputs = tokenizer(current_x, return_tensors="pt").to("cuda") start_time = time.time() logits = model(**inputs).logits.softmax(-1) end_time = time.time() label = logits.argmax(-1).item() y_pred.append(label) times.append(end_time - start_time)
print(accuracy_score(y_true, y_pred)) print(pd.Series(times).describe().T)
Things seem to be working as intended! I went from using GPT-J-6B with
to
With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!
However, I don't think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I'm using something like:
Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:
with requirements.txt being