Closed Qubitium closed 1 year ago
Thank you for the bug report. I'll take a look to see what's going on.
I'm having difficulty replicating. Would you mind sharing your prompt length and generate call? Here's my attempt:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from gptq_triton import load_quant
from transformers import AutoTokenizer, LlamaForCausalLM
import random
model_path = 'weights/llama-7b-triton-4bit-c4-group-1-act-seq/'
model = load_quant(model_path, warmup_autotune=False)
model.eval()
model.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
target_prompt_length = 2048
prompt = ''
while True:
prompt = prompt + ''.join(random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,;:!?\n') for _ in range(2048 * 10))
# Encode and crop down
encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
if encoded_prompt.shape[1] > target_prompt_length:
encoded_prompt = encoded_prompt[:, :target_prompt_length]
encoded_prompt = encoded_prompt.to('cuda')
break
output_sequences = model.generate(
input_ids=encoded_prompt,
max_length=128 + len(encoded_prompt[0]),
temperature=0.7,
num_return_sequences=1,
num_beams=2,
length_penalty=1.3,
do_sample=True,
)
I ran it a few times with a couple different prompt lengths but never got it to error out. It's possible the issue only crops up on 30B model, which I didn't try with yet.
I will isolate the setting/param that is causing this on my end and use your above gen code as baseline.
Iterations finding so far, where my generate is exactly the same as yours num_beams=2, temp=0.7, etc 7 params total.
output_sequences = model.generate(
input_ids=encoded_prompt,
max_length=128 + len(encoded_prompt[0]),
temperature=0.7,
num_return_sequences=1,
num_beams=2,
length_penalty=1.3,
do_sample=True,
Here are diff results so far:
Errors when:
max_length=512
, I crash constantly like before RuntimeError: probability tensor contains either
inf,
nanor element < 0
max_length
and set max_new_tokens=512
, I get non crash but over 75% of the output is garbled junk.Above 2 Errors at the following input/ouput sizes:
Prompt tokens size: 156
Output tokens size (including prompt prefix): 327
New tokens: 171
No errors, normal result at the following sizes:
Prompt tokens size: 76
Output tokens size (including prompt prefix): 98
New tokens: 22
My tokenizer is using fast but tested non-fast and nothing changes:
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, padding_side="left", add_special_tokens=False, use_fast=True)
The above are using your latest head requantized to 30b 4bit sequential groupsizes=128.
Edit: I am still continuing to test to weed out all other possibilities. Above is just what I have found so far.
I have isolated out the following as src of issue:
Edit: Added transforms head+stable test. Edit2: I have stopped testing as I have run out of diff env/pkg diffs to test that I think may affect the runtime.
qwop version has very similar batch bug and was supposed fixed with https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/d1c6d72146af0f462993b5f380ac6362cb27fe02 which I have not tested yet but since the triton codebase had it's origin here, perhaps the bug fix can be ported over.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/commit/d1c6d72146af0f462993b5f380ac6362cb27fe02
Thank you for the details, I've got it to reproduce now. It triggers when the generated tokens length is >=256, so my 128 test didn't trigger it. Also it triggers even on the 7B model, so testing is easier.
I'll start digging and see what's going on.
Occurs even with an FP16 hugging face LLaMA so this appears to be an issue with the transformers
library, not GPTQ-triton. Still, I'm curious enough to keep digging.
I've opened a bug report at transformers
with details of my analysis thus far: https://github.com/huggingface/transformers/issues/22914
Closing this issue as it isn't related specifically to GPTQ-triton.
On longer prompts/inputs I am encountering the following error when num_beams is set to 2. Shorter prompts appears to have no issue with multiple beams.