Closed pseudotensor closed 9 months ago
I understand why users might report this as an issue. Previously, we hard-coded rope_theta=10000
in 0.1.7 but in 0.1.8, we read it from the config. The model linked just so happens to have rope_theta=1000000
which might be a bit slower.
So you can explain the ~1s delay between the first token and the 2+ tokens? It's not just slower, that's just one issue. The other one was the large lag between the first token and all other tokens.
I am not able to explain it without further checking out between commits, which I won't have time to do right now. A potential issue is the change in how we handle position ids. Tough to say without thorough testing.
Understood, I haven't seen such a massive lag before in other tools, and 0.1.7 doesn't do it. I'll recommend users use 0.1.7 if they are concerned. Thanks!
Closing this issue as not planned for now, I am not having an easy time reproducing the suggested delay that was introduced.
Reported by user of h2oGPT: https://github.com/h2oai/h2ogpt/issues/1309
I used an edited version of the text streamer, only changed by printing every token instead of waiting for a space. You'll see it print Wh, then wait nearly 0.5 seconds, then continue.
This only occurs with 0.1.8
and not in 0.1.7:
Also, the script for the 2+ generations runs in 3.6 seconds with 0.1.8 and in 2.5 seconds in 0.1.7.
So there are 2 problems, but perhaps the delay accounts for all of the difference.
Script, ran on 4A10G or 4A6000:
quant_path = "TheBloke/openchat_3.5-16k-AWQ"
Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True) tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True) streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt_template = """GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"""
tokens = tokenizer( prompt_template.format(prompt=prompt), return_tensors='pt' ).input_ids.cuda()
Generate output
import time t0 = time.time() generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=256, ) print("duration: %s" % (time.time() - t0), flush=True)
import time t0 = time.time() generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=256, ) print("duration: %s" % (time.time() - t0), flush=True)
time.sleep(100)