Open woaipichuli opened 5 months ago
验证chatglm2-6b和chatglm2-6b-int4都出现首token时延随输入长度成倍快速增长,从输入长度512到2048,首token时延从500ms增长至1.8s
输入部分应该是并行的,为什么增长会这么明显
tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True) base_model = AutoModel.from_pretrained(base_model_name_or_path, trust_remote_code=True, revision=True) model = PeftModel.from_pretrained(base_model, peft_model_id,torch_dtype=torch.float16)
str="测试文本" pt_data = tokenizer(str, return_tensors="pt", padding=True).to('cuda') gen_kwargs = {"max_length": pt_data["input_ids"].shape[-1] + 1, "num_beams": 1, "do_sample": False, "top_p": 0.8, "temperature": 0, "logits_processor": logits_processor}
outputs = model.generate(pt_data, gen_kwargs)
V100和T4两个GPU上都进行了验证
No response
Is there an existing issue for this?
Current Behavior
验证chatglm2-6b和chatglm2-6b-int4都出现首token时延随输入长度成倍快速增长,从输入长度512到2048,首token时延从500ms增长至1.8s
Expected Behavior
输入部分应该是并行的,为什么增长会这么明显
Steps To Reproduce
tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True) base_model = AutoModel.from_pretrained(base_model_name_or_path, trust_remote_code=True, revision=True) model = PeftModel.from_pretrained(base_model, peft_model_id,torch_dtype=torch.float16)
str="测试文本" pt_data = tokenizer(str, return_tensors="pt", padding=True).to('cuda') gen_kwargs = {"max_length": pt_data["input_ids"].shape[-1] + 1, "num_beams": 1, "do_sample": False, "top_p": 0.8, "temperature": 0, "logits_processor": logits_processor}
outputs = model.generate(pt_data, gen_kwargs)
Environment
Anything else?
No response