fpgaminer / GPTQ-triton

GPTQ inference Triton kernel
Apache License 2.0
285 stars 23 forks source link

num_beams > 1 sometimes breaks inference #11

Closed Qubitium closed 1 year ago

Qubitium commented 1 year ago
env:
transformers [fpga PR performance-fix branch]
pytorch 2.0.0+cu118
Nvidia 4090
Model: 30B 4bit act-order sequential (quantized using GPTQ-triton script)

num_beams = 2
length_penalty  = 1.3

On longer prompts/inputs I am encountering the following error when num_beams is set to 2. Shorter prompts appears to have no issue with multiple beams.

  File "/root/test.py", line 223, in process
    gen_output = model.generate(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1585, in generate
    return self.beam_sample(
  File "/root/miniconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 3210, in beam_sample
    next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
fpgaminer commented 1 year ago

Thank you for the bug report. I'll take a look to see what's going on.

fpgaminer commented 1 year ago

I'm having difficulty replicating. Would you mind sharing your prompt length and generate call? Here's my attempt:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from gptq_triton import load_quant
from transformers import AutoTokenizer, LlamaForCausalLM
import random

model_path = 'weights/llama-7b-triton-4bit-c4-group-1-act-seq/'
model = load_quant(model_path, warmup_autotune=False)
model.eval()
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

target_prompt_length = 2048
prompt = ''

while True:
    prompt = prompt + ''.join(random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,;:!?\n') for _ in range(2048 * 10))
    # Encode and crop down
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
    if encoded_prompt.shape[1] > target_prompt_length:
        encoded_prompt = encoded_prompt[:, :target_prompt_length]
        encoded_prompt = encoded_prompt.to('cuda')
        break

output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=128 + len(encoded_prompt[0]),
    temperature=0.7,
    num_return_sequences=1,
    num_beams=2,
    length_penalty=1.3,
    do_sample=True,
)

I ran it a few times with a couple different prompt lengths but never got it to error out. It's possible the issue only crops up on 30B model, which I didn't try with yet.

Qubitium commented 1 year ago

I will isolate the setting/param that is causing this on my end and use your above gen code as baseline.

Qubitium commented 1 year ago

Iterations finding so far, where my generate is exactly the same as yours num_beams=2, temp=0.7, etc 7 params total.

output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=128 + len(encoded_prompt[0]),
    temperature=0.7,
    num_return_sequences=1,
    num_beams=2,
    length_penalty=1.3,
    do_sample=True,

Here are diff results so far:

Errors when:

  1. set max_length=512, I crash constantly like before RuntimeError: probability tensor contains eitherinf,nanor element < 0
  2. don't set max_length and set max_new_tokens=512, I get non crash but over 75% of the output is garbled junk.

Above 2 Errors at the following input/ouput sizes:

Prompt tokens size: 156
Output tokens size (including prompt prefix):  327
New tokens: 171

No errors, normal result at the following sizes:

Prompt tokens size: 76
Output tokens size (including prompt prefix):  98
New tokens: 22

My tokenizer is using fast but tested non-fast and nothing changes:

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="left", add_special_tokens=False, use_fast=True)

The above are using your latest head requantized to 30b 4bit sequential groupsizes=128.

Edit: I am still continuing to test to weed out all other possibilities. Above is just what I have found so far.

Qubitium commented 1 year ago

I have isolated out the following as src of issue:

  1. Tested pytorch 2.1 nightly + 2.0 stable
  2. Tested Openai/Triton 2.0.0 and 2.0.0.post1 (nightly)
  3. Tested Transformers 2.29.0dev (head as of this post) and 2.28.1 (release)

Edit: Added transforms head+stable test. Edit2: I have stopped testing as I have run out of diff env/pkg diffs to test that I think may affect the runtime.

Qubitium commented 1 year ago

qwop version has very similar batch bug and was supposed fixed with https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/d1c6d72146af0f462993b5f380ac6362cb27fe02 which I have not tested yet but since the triton codebase had it's origin here, perhaps the bug fix can be ported over.

https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/d1c6d72146af0f462993b5f380ac6362cb27fe02

fpgaminer commented 1 year ago

Thank you for the details, I've got it to reproduce now. It triggers when the generated tokens length is >=256, so my 128 test didn't trigger it. Also it triggers even on the 7B model, so testing is easier.

I'll start digging and see what's going on.

fpgaminer commented 1 year ago

Occurs even with an FP16 hugging face LLaMA so this appears to be an issue with the transformers library, not GPTQ-triton. Still, I'm curious enough to keep digging.

fpgaminer commented 1 year ago

I've opened a bug report at transformers with details of my analysis thus far: https://github.com/huggingface/transformers/issues/22914

Closing this issue as it isn't related specifically to GPTQ-triton.