Memory Decreases! But Latency Increases....

mitchellgordon95 commented 2 years ago

Things seem to be working as intended! I went from using GPT-J-6B with

model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))

to

model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)

With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!

However, I don't think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I'm using something like:

output_ids = self.model.generate(
    input_ids.cuda(),
    max_length=45,
    do_sample=True,
    top_p=request.get("top_p", 1.0),
    top_k=request.get("top_k", 50),
   ...
)

Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

ENV APP_HOME /app
WORKDIR $APP_HOME

# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git

RUN apt-get update
RUN apt-get install --yes git

# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim

# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH

# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch

# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Copy local code to container image
COPY *.py ./

CMD ["python", "model.py"]

with requirements.txt being

kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8

TimDettmers commented 2 years ago

Hi Mitchell!

Currently, this is expected, but we are aware of the issues, and we plan to solve the issues that can be resolved in future releases.

To summarize the issues:

For the release of a memory efficient implementation I needed to quickly roll a CUDA kernel for outlier extraction from matrices with a special format (COL4_4R2_8C and COL32_2R_4R4, aka colTuring and colAmpere). The CUDA kernel is currently not very efficient.
The fp16 matrix multiplication used in conjunction with Int8 matmul is currently run in the same CUDA stream. This makes processing sequential even though the multiplications are independent.
The fp16 matrix multiplication kernel might not be fully optimized for the extreme matrix sizes used in the outlier multiplication. A custom kernel would be lightning fast, but would require some work.
Overall, int8 matrix multiplication is not very fast for small models. This is so, because it is difficult to saturate the GPU cores with int8 elements, and as such int8 is just as fast as fp16 for small models. However, one has additional overhead of quantization which slows overall inference down. Raw speedups for a 6B model would be maybe 20-40%. I am not sure about inference though since the overhead is more complex and depends on many factors (sequence length, batch size etc).

I have not done precise benchmarks, but if I distributed a weight of 1.0 for all these issues in terms of which one slows the system down the most, this would be my guess: (1) 10%, (2) 20%, (3) 60%, (4) 10%.

In other words, the most effective would be a custom kernel for fp16 matmul, followed by a fp16 matmul done in a second stream, followed by a better CUDA kernel for outlier extraction, and then hard ware issues (not solvable).

mitchellgordon95 commented 2 years ago

Thanks Tim! Looking forward to future releases. Feel free to close or leave open, whichever seems more appropriate.

younesbelkada commented 2 years ago

Hi @mitchellgordon95 ! Thanks for your interest in the feature 💪 Just out of curiosity and if you have time, could you try to run your benchmark with model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True, int8_threshold=0) ? I think that you may observe similar performance than fp16 model in terms of latency but not sure.

mitchellgordon95 commented 2 years ago

Hi Younes!

That did decrease the latency, but it is still around 6.1s which is still almost double the latency without int8.

younesbelkada commented 2 years ago

That is very good to know! Thank you very much @mitchellgordon95 🙏

TimDettmers commented 2 years ago

I would have expected to be faster for GPT-J. But that is great feedback, and this then will be one of my cornerstone models for benchmarking. Thank you, Mitchell!

TimDettmers commented 2 years ago

We analyzed the use case and found issues that we could partially resolve, speeding up smaller models by 2x. Please give the newest release, 0.32.0, another try. You should still see some slowness but it should be much improved already.

The slowness was not related to what we were thinking and stems from the small amount of compute that is done during token-by-token inference compared to how much overhead there is. The main overhead came from bias computation which was fused in PyTorch case but was not fused in bitsandbytes. We fixed this issue in the most recent release.

Another source of slowness was retrieving a pointer from PyTorch storage that is needed for CUDA functions.

Further sources are as follows:

CUDA kernel configurations optimized for large input matrices
cudaSetDevice function slow in PyTorch
quantization statistics are currently initialized with torch.zeros(...) instead of torch.empyt().

Fixing these other sources of slowness will happen over the next weeks and should give another 2x acceleration for small models.

Oxi84 commented 2 years ago

Good to know. You are doing great job. So is it now faster or slower than fp16 for GPT-J case?

I will try in few days myself. So far i could not get T5 working with this.

mitchellgordon95 commented 2 years ago

Thanks for the update, Tim!

I'm now seeing around 3.1s without quantization, 9.3s with load_in_8bit=True, and 5.7s with load_in_8bit=True,int8_threshold=0. So definitely better, but still room for improvement. (Compare with 12s / 6.1s previously.)

TimDettmers commented 2 years ago

Thank you, Mitchell! The new performance data looks good and will help us to calibrate. We will keep you updated as we make progress. We are currently planning to support older GPUs and then improve performance. So likely, it will take some time for the next performance improvements to trickle in, but it is on our roadmap.

Oxi84 commented 2 years ago

For me it takes around 250 seconds to generate 1000 words on RTX 3090, when using 8bit without ,int8_threshold=0. When using ,int8_threshold=0, the generation time is 88 seconds. For 500 words sequence, without int8_threshold=0 it takes 53 seconds, while with it takes 22 seconds.

So in general int8_threshold=0 makes it 2-3 times faster. Memory usage is around 8-9GB,

Oxi84 commented 2 years ago

It is awesome you made this. Chinese GLM even works on 4 bits.

https://github.com/THUDM/GLM-130B

It seem to be the best language model so far.

Hukongtao commented 1 year ago

This problem seems to still exist？

kd303 commented 1 year ago

Hi, We recently tested codgen 2b model with DJL and DeepSpeed as backend engine. With the latest version of bitnbytes (0.40+), CUDA 11.x, on 20GB A100 MiG, In Deepspeed FLOPs profiler we found following logs:

8bit:
INFO  PyProcess [1,0]<stdout>:fwd flops per GPU: 263.13 M
INFO  PyProcess [1,0]<stdout>:fwd flops of model = fwd flops per GPU * mp_size: 263.13 M
INFO  PyProcess [1,0]<stdout>:fwd latency: 122.43 ms
INFO  PyProcess [1,0]<stdout>:fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 2.15 GFLOPS

16bit:

INFO  PyProcess [1,0]<stdout>:fwd MACs per GPU: 2.65 GMACs
INFO  PyProcess [1,0]<stdout>:fwd flops per GPU:  5.3 G
INFO  PyProcess [1,0]<stdout>:fwd flops of model = fwd flops per GPU * mp_size: 5.3 G
INFO  PyProcess [1,0]<stdout>:fwd latency: 75.68 ms
INFO  PyProcess [1,0]<stdout>:fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 69.98 GFLOPS

Do let us know if there is anything we can do to help or debug this further..

datalee commented 1 year ago

This problem seems to still exist？

datalee commented 1 year ago

Continued attention

shiqingzhangCSU commented 1 year ago

This problem seems to still exist when I test Llama.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

adityaemmanuel commented 6 months ago

I still observe this issue when loading heegyu/TinyLlama-augesc-context model from huggingface in 4-bit and 8-bit. Average inference time over 100 runs: (similar results irrespective of the llm_int8_threshold)

Base Model - 0.04s
4-bit Model - 0.09s
8-bit Model - 0.12s

Code to reproduce:

from sklearn.metrics import accuracy_score from datasets import load_dataset from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig import time import pandas as pd import torch dataset = load_dataset("heegyu/augesc")

label_map = { "Question":0, "Restatement or Paraphrasing":1, "Reflection of feelings":2, "Self-disclosure":3, "Affirmation and Reassurance":4, "Providing Suggestions":5, "Information":6, "Others":7 }

x = [] y_true = [] for sample in dataset['test']: for row in sample['dialog']: text = row['text'] label = row['strategy']

    if label != None:
        x.append(text)
        y_true.append(label_map[label])

x = x[0:1000] y_true = y_true[0:1000] model_id = "heegyu/TinyLlama-augesc-context"

bnb_config = BitsAndBytesConfig(load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id, quantization_config=bnb_config)

param_size = 0 for param in model.parameters(): param_size += param.nelement() param.element_size() buffer_size = 0 for buffer in model.buffers(): buffer_size += buffer.nelement() buffer.element_size()

model_size = (param_size + buffer_size) / 1024**2 print('Base Model size: {:.3f}MB'.format(model_size))

y_pred = [] times = [] for current_x, current_y in zip(x, y_true): inputs = tokenizer(current_x, return_tensors="pt").to("cuda") start_time = time.time() logits = model(**inputs).logits.softmax(-1) end_time = time.time() label = logits.argmax(-1).item() y_pred.append(label) times.append(end_time - start_time)

print(accuracy_score(y_true, y_pred)) print(pd.Series(times).describe().T)

bitsandbytes-foundation / bitsandbytes

Memory Decreases! But Latency Increases.... #6