Is This Inference Speed Slow?

xdevfaheem commented 1 year ago

@PanQiWei @TheBloke

So Here is my Script for Infrence:

import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM
from huggingface_hub import hf_hub_download
from transformers import GenerationConfig
import time

#model_path = hf_hub_download(repo_id="TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ", filename="gptq_model-4bit-64g.safetensors")

# Download the model from HF and store it locally, then reference its location here:
#quantized_model_dir = model_path

from transformers import AutoTokenizer, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained(
    "TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ",
    use_fast=False
)
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ",
    use_triton=False,
    use_safetensors=True,
    device="cuda:0",
    trust_remote_code=True,
    max_memory={i: "13 GIB" for i in range(torch.cuda.device_count())}
)

#pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device_map="auto")

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
tokens = tokenizer(prompt_template, return_tensors="pt").to(model.device)
gen_config = GenerationConfig(max_new_tokens=256, temperature=0.3, top_k=35, top_p=0.90, pad_token_id=tokenizer.eos_token_id)
output = model.generate(inputs=tokens.input_ids, generation_config=gen_config)
print(tokenizer.decode(output[0]))

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

This is What The Output is,

### Instruction: Write a story about alpaca
### Response:Once upon a time, in a small village nestled in the mountains, there lived a young girl named Maya. Maya was known for her love of animals, especially alpacas. She spent most of her days tending to the village's small herd of alpacas, helping to groom and feed them.
One day, Maya received a letter in the mail from a far-off land. The letter was from a group of scientists who were studying alpacas and had discovered something amazing. They had found a way to use alpaca wool to create a new type of fabric that was both warm and waterproof.
Maya was thrilled at the prospect of using her beloved alpacas to help the world. She immediately set out to learn more about the new fabric and how it could be used. She spent months studying and experimenting, and eventually, she came up with a plan to create a line of clothing made entirely from the new fabric.
With the help of her friends and family, Maya began to weave the fabric into clothing, scarves, and even blankets. The clothing was not only warm and waterproof, but it was also incredibly soft and comfortable. Maya's designs were a hit, and soon her clothing was being sold all over the world.
Maya's success

Total inference time: 210.78 ms
Number of tokens generated: 256
Time per token: 0.82 ms/token
Tokens per second: 1.21 token/s

I Think 1 Tokens Per Second is too low for GPTQ on gpu. Or is this Normal? or Is there Anything I Should Adjust to increase the Infrence Speed?

xdevfaheem commented 1 year ago

Can Somebody Help me with This. I think ggml with blas backend is must more faster than gptq

TheBloke commented 1 year ago

The Falcon models are currently very slow with GPTQ. For example on an H100, which is the fastest GPU available, I get this:

Total inference time: 37.49 ms
Number of tokens generated: 256
Time per token: 0.15 ms/token
Tokens per second: 6.83 token/s

By comparison, a Llama 7B model would give 45 tokens/s on this system, or with a faster CPU I would get 100+ tokens/s.

If you want speed, don't use Falcon at the moment.

Can Somebody Help me with This. I think ggml with blas backend is must more faster than gptq

There is no GGML support for Falcon yet.

GPTQ outperforms GGML by about 2x in situations where there is enough VRAM to load the model. GGML + CUDA acceleration is faster than GPTQ in situations where you don't have enough VRAM to load the model, and need to also use CPU RAM. That is not the case here.

xdevfaheem commented 1 year ago

The Falcon models are currently very slow with GPTQ. For example on an H100, which is the fastest GPU available, I get this:
Total inference time: 37.49 ms
Number of tokens generated: 256
Time per token: 0.15 ms/token
Tokens per second: 6.83 token/s
By comparison, a Llama 7B model would give 45 tokens/s on this system

If you want speed, don't use Falcon at the moment.

Can Somebody Help me with This. I think ggml with blas backend is must more faster than gptq

There is no GGML support for Falcon yet.

GPTQ outperforms GGML by about 2x in situations where there is enough VRAM to load the model. GGML + CUDA acceleration is faster than GPTQ in situations where you don't have enough VRAM to load the model, and need to also use CPU RAM. That is not the case here.

Is There Any other Model that you created with AUTO-GPTQ. Any!

And did you this Script for Inference. if not can you share your inference script?

TheBloke commented 1 year ago

I used your script exactly

I have like 40+ GPTQ models on my Hugging Face page. All of them should work with AutoGPTQ.

A model doesn't need to be created with AutoGPTQ to work with AutoGPTQ. It is compatible also with models made with GPTQ-for-LLaMa.

Soon I will start making all models with AutoGPTQ. But you can use AutoGPTQ with all GPTQ models, don't worry about what made it. If you find a model that doesn't work, ping me about it.

TheBloke commented 1 year ago

Try one of these:

https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ https://huggingface.co/TheBloke/guanaco-7B-GPTQ

xdevfaheem commented 1 year ago

Sure! Let me Try it Using Any of your GPTQ Model and Test it's inference Time. And Thank You Very much For Helping me throught this day without hesitation. Really Thanks Man!

TheBloke commented 1 year ago

You're welcome.

Please don't expect huge numbers on Colab, especially if it's free Colab with a T4 GPU. Those are pretty slow.

xdevfaheem commented 1 year ago

Ahh... Again

Code:

import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM
from huggingface_hub import hf_hub_download
from transformers import GenerationConfig

#model_path = hf_hub_download(repo_id="TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ", filename="gptq_model-4bit-64g.safetensors")

# Download the model from HF and store it locally, then reference its location here:
#quantized_model_dir = model_path

from transformers import AutoTokenizer, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained(
    "TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ",
    use_fast=False
)
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ",
    use_triton=False,
    use_safetensors=True,
    device="cuda:0",
    trust_remote_code=True,
    max_memory={i: "13 GIB" for i in range(torch.cuda.device_count())}
)

#pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device_map="auto")

Got This Error,

Downloading (…)okenizer_config.json: 100%
727/727 [00:00<00:00, 43.4kB/s]
Downloading tokenizer.model: 100%
500k/500k [00:00<00:00, 7.31MB/s]
Downloading (…)cial_tokens_map.json: 100%
435/435 [00:00<00:00, 17.2kB/s]
Downloading (…)lve/main/config.json: 100%
582/582 [00:00<00:00, 36.8kB/s]
Downloading (…)quantize_config.json: 100%
124/124 [00:00<00:00, 6.76kB/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 19>:19                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py:82 in from_quantized          │
│                                                                                                  │
│    79 │   │   model_type = check_and_get_model_type(save_dir or model_name_or_path, trust_remo   │
│    80 │   │   quant_func = GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized                   │
│    81 │   │   keywords = {key: kwargs[key] for key in signature(quant_func).parameters if key    │
│ ❱  82 │   │   return quant_func(                                                                 │
│    83 │   │   │   model_name_or_path=model_name_or_path,                                         │
│    84 │   │   │   save_dir=save_dir,                                                             │
│    85 │   │   │   device_map=device_map,                                                         │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py:698 in from_quantized        │
│                                                                                                  │
│   695 │   │   │   │   │   break                                                                  │
│   696 │   │                                                                                      │
│   697 │   │   if resolved_archive_file is None: # Could not find a model file to use             │
│ ❱ 698 │   │   │   raise FileNotFoundError(f"Could not find model in {model_name_or_path}")       │
│   699 │   │                                                                                      │
│   700 │   │   model_save_name = resolved_archive_file                                            │
│   701                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Could not find model in TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ

TheBloke commented 1 year ago

Oh, one thing to mention. With all those models you need to pass model_basename to AutoGPTQForCausalLM.from_quantized()

Look at the name of the safetensors file and use all of that name except .safetensors.

For example, for WizardLM set

model_basename='WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order'

xdevfaheem commented 1 year ago

Lemme Try Again...

xdevfaheem commented 1 year ago

Code:

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

import time
start = time.time()
tokens = tokenizer(prompt_template, return_tensors="pt").to("cuda:0")
gen_config = GenerationConfig(max_new_tokens=256, temperature=0.3, top_k=35, top_p=0.90, pad_token_id=tokenizer.eos_token_id)
output = model.generate(inputs=tokens.input_ids, generation_config=gen_config)
print(tokenizer.decode(output[0]))

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

Inference Time:

### Instruction: Write a story about alpaca
### Response: Once upon a time, there was a herd of alpacas living in the Andes Mountains. They lived in a beautiful valley surrounded by snow-capped peaks. The alpacas were happy and content, grazing on the lush green grass and enjoying the fresh mountain air.

One day, a group of tourists arrived in the valley. They were amazed by the alpacas and their beautiful fleece. The tourists were so impressed that they decided to start a business selling alpaca products.

The alpacas were initially hesitant about the newcomers, but they soon warmed up to them. The tourists taught the alpacas how to wear hats and sweaters, and the alpacas were thrilled to be a part of the fashion industry.

The alpacas became famous, and people from all over the world came to visit them. They were even featured in magazines and on TV shows. The alpacas were so happy to be a part of something bigger than themselves.

As time went on, the alpacas grew older and their fleece became finer. They
Total inference time: 196.43 ms
Number of tokens generated: 256
Time per token: 0.77 ms/token
Tokens per second: 1.30 token/s

Still Getting the Same Inference Time. No Improvement.

MAybe i Should Get back to ggml with blas. :(

xdevfaheem commented 1 year ago

No luck :(

TheBloke commented 1 year ago

What system are you running this on? Google Colab?

Speed that slow would indicate it's not using the GPU at all. I think you still have AutoGPTQ installation problems.

Please show output of:

which nvcc
nvcc --version
pip freeze | grep gptq

Also, run your code again but change: max_new_tokens=1024 min_new_tokens=1024

and while it is running, please check GPU usage with this command:

nvidia-smi --query-gpu=timestamp,name,driver_version,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

See if it is actually using the GPU at all.

PanQiWei commented 1 year ago

max_memory={i: "13 GIB" for i in range(torch.cuda.device_count())}

@TheFaheem There should be no space between 13 and GIB

xdevfaheem commented 1 year ago

Let me Try...

xdevfaheem commented 1 year ago

which nvcc nvcc --version pip freeze | grep gptq

I'm on Free Colab with T4.

The Outputs of:

!which nvcc

/usr/local/cuda/bin/nvcc

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

pip freeze | grep gptq

auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.2.1/auto_gptq-0.2.1+cu118-cp310-cp310-linux_x86_64.whl#sha256=cb763c29e3ffd2bd3b548fcd42c2d0c4b2314f4b62469560f9232ca96eba3d12

Also When Loading The Model, Got These,

WARNING:auto_gptq.nn_modules.qlinear_old:CUDA extension not installed.
WARNING:accelerate.utils.modeling:The safetensors archive passed at /root/.cache/huggingface/hub/models--TheBloke--WizardLM-7B-uncensored-GPTQ/snapshots/cc635a081c838a1e50cbd290dd08dd561ad7edf7/WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
WARNING:auto_gptq.nn_modules.fused_llama_mlp:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.

xdevfaheem commented 1 year ago

Also I CAn't Run This Command while Another cell is running because, it's colab.

nvidia-smi --query-gpu=timestamp,name,driver_version,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

But It's Utilising Some GPU too... Screenshot from 2023-06-03 09-39-20

xdevfaheem commented 1 year ago

And for 1024 Tokens, I Got These...

### Response:Once upon a time, in a small village in the Andes Mountains, there lived a young girl named Carmen. She was fascinated by the beautiful alpacas that roamed the hillsides and spent her days observing them. One day, while Carmen was out tending to her family's farm, she saw a group of travelers passing through the village. They were carrying with them a precious cargo - a baby alpaca.

Carmen was thrilled at the sight of the baby alpaca and begged the travelers to let her take care of it. They agreed, and Carmen became the proud owner of the little alpaca. She named him Nacho and spent her days taking care of him, learning everything she could about him. She learned how to feed him, how to groom him, and even how to ride him.

As Nacho grew older, he became a valuable asset to Carmen's family. They would often take him to the nearby markets to sell his wool, which was highly prized for its softness and quality. Carmen became known as the girl who looked after the beautiful alpaca, and people would often stop her in the streets to ask about him.

One day, a group of travelers came through the village and Carmen was given the opportunity to show off Nacho. They were impressed by his beauty and size, and offered to buy him from her. Carmen was torn between her love for Nacho and her desire to make a profit from him. But in the end, she decided to keep him and continue to care for him.

Years went by, and Carmen grew old. On her deathbed, she made a promise to Nacho that she would always be with him, watching over him. And so, she died, but her spirit lived on in Nacho. The alpaca became a symbol of hope and love, and people would often come to visit him and hear the story of Carmen.

And so, the story of Carmen and Nacho the alpaca lived on, a testament to the enduring bond between a girl and her beloved animal.

Total inference time: 349.68 ms
Number of tokens generated: 1024
Time per token: 0.34 ms/token
Tokens per second: 2.93 token/s

xdevfaheem commented 1 year ago

@PanQiWei @TheBloke

xdevfaheem commented 1 year ago

On the Other Hand, GGML with Blas Backend is Way more Faster than GPTQ. Here is The

Code:

from llama_cpp import Llama
from huggingface_hub import hf_hub_download

lcpp_llm = None
model_path = hf_hub_download(repo_id="TheBloke/Wizard-Vicuna-7B-Uncensored-GGML", filename="Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin")
lcpp_llm = Llama(model_path=model_path, n_threads=2, n_batch=512, n_gpu_layers=32)
import time
prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"
max_token = 256
start = time.time()

response = lcpp_llm(prompt=prompt_template, max_tokens=max_token, temperature=0.5)

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / max_token

# Calculate tokens per second
tokens_per_second = max_token / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(max_token))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

Inference Time:

Total inference time: 81.08 ms
Number of tokens generated: 256
Time per token: 0.32 ms/token
Tokens per second: 3.16 token/s

I Want GPTQ to be Faster Than This. : (

PanQiWei commented 1 year ago

WARNING:auto_gptq.nn_modules.qlinear_old:CUDA extension not installed.

Seems auto-gptq's CUDA extension not be installed properly, that's the reason why inference is slow for you. Maybe you should try to re-install auto-gptq and see any improvements

TheBloke commented 1 year ago

Yeah that's what I was looking for. That is why it's so slow @TheFaheem

Now please run this command:

python3 -c 'import torch ; import autogptq_cuda'

and show the full output

xdevfaheem commented 1 year ago

python3 -c 'import torch ; import autogptq_cuda'

Here it is

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/lib/python3.10/dist-packages/autogptq_cuda.cpython-310-x86_64-linux-gnu.so)

xdevfaheem commented 1 year ago

But After Installing From the Source,

Some Lines of Log:

Building wheels for collected packages: auto-gptq
  Building wheel for auto-gptq (setup.py) ... done
  Created wheel for auto-gptq: filename=auto_gptq-0.2.1-cp310-cp310-linux_x86_64.whl size=3651770 sha256=9ec79e13097b5dca6ffb53105b2d9a89d967fa7f61550d6edee881e487e36bb8
  Stored in directory: /tmp/pip-ephem-wheel-cache-jxylb8g2/wheels/8c/41/f4/48ea4848ab4977e74d11a4abbc2c42745c5b1d33f931e8cadf
Successfully built auto-gptq
Installing collected packages: auto-gptq
  Attempting uninstall: auto-gptq
    Found existing installation: auto-gptq 0.2.1+cu118
    Uninstalling auto-gptq-0.2.1+cu118:
      Successfully uninstalled auto-gptq-0.2.1+cu118
Successfully installed auto-gptq-0.2.1

After This, I Ran This command python3 -c 'import torch ; import autogptq_cuda' Again.

It Shows No Output and Ran Fine.

TheBloke commented 1 year ago

Yes I was just writing instructions on manual install

Now please try the inference test again.

xdevfaheem commented 1 year ago

Huge Improvement. Insane!

</s> ### Instruction: What are Falcons?
### Response:Falcons are a type of bird of prey, commonly found in the wild and also kept as pets. They are known for their sharp talons and beaks, which they use to catch and kill prey. Falcons are typically small to medium-sized birds, with a wingspan of up to 5 feet. They have a streamlined body shape and long, pointed wings that allow them to fly at high speeds and make sudden turns. Falcons are found all over the world, in a variety of habitats, including deserts, forests, and grasslands. They are carnivorous and primarily eat small mammals, such as mice and rabbits.</s>
Total inference time: 12.17 ms
Number of tokens generated: 256
Time per token: 0.05 ms/token
Tokens per second: 21.03 token/s

xdevfaheem commented 1 year ago

bruhh.. Just Compiling from the Source Solved it! : )

xdevfaheem commented 1 year ago

It Took a Day To Get This Inference Speed, tbh.

So Finally, Compiling From Source Works.

Is There Any Perplexity Drop or Accuracy Drop in GPTQ Quants? This is What Concerns me a little bit.

xdevfaheem commented 1 year ago

Compiling From Source Worked Because the cuda version used in pytorch is same as Colab's Cuda Version. So It Compiled Successfully.

Cuda Version used by pyorch:

Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.0.1+cu118)

Colab's cuda Version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

geekinglcq commented 1 year ago

python3 -c 'import torch ; import autogptq_cuda'

Here it is

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /usr/local/lib/python3.10/dist-packages/autogptq_cuda.cpython-310-x86_64-linux-gnu.so)

Hi, I have met the same problem of GLIBC_2.32' not found but unfortunately, I cannot fix it by compile AutoGPTQ from source as some other error occurs. Could you tell me your pytorch version so that I can check if the pytorch version matters. Thanks~

xdevfaheem commented 1 year ago

My Friend,

Can Compile From Source and Show me the Error Log...

geekinglcq commented 1 year ago

My Friend,

Can Compile From Source and Show me the Error Log...

The error is identifier "__hfma2" is undefined. Just solved the problem by add env setting: TORCH_CUDA_ARCH_LIST="7.5" pip install . Thanks~

xdevfaheem commented 1 year ago

My Friend, Can Compile From Source and Show me the Error Log...

The error is identifier "__hfma2" is undefined. Just solved the problem by add env setting: TORCH_CUDA_ARCH_LIST="7.5" pip install . Thanks~

Glad!

AutoGPTQ / AutoGPTQ

Is This Inference Speed Slow? #130