Closed koking0 closed 7 months ago
You can modify the following code: https://github.com/bytedance/lightseq/blob/master/examples/inference/python/export/huggingface/hf_gpt2_export.py#L165
I'm not sure about the function of max_step in huggingface, if you want to
stop generation and get the results that may not ended correctly, you could
change the max_step in
examples/inference/python/export/huggingface/hf_gpt2_export.py
On Fri, Jul 8, 2022 at 12:30 PM Alex_996 @.***> wrote:
I am trying to accelerate the inference of a Chinese GPT2 model with LightSeq. Although different sentences can be generated by configuring the topk and topp parameters, the length cannot be adjusted. The model.sample method can support the max_length parameter of the model.generate method similar to huggingface?
my GPT2-HuggingFace code:
hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
tokenizer = GPT2Tokenizer.from_pretrained(hf_model_path)
question = "西湖的景色"
inputs = tokenizer(question, return_tensors='pt')["input_ids"]
model = GPT2LMHeadModel.from_pretrained(hf_model_path)
model.save_pretrained("./models/Wenzhong-GPT2-110M")
模型推断开始
start_time = time.perf_counter()
generated_ids = model.generate(inputs, return_dict_in_generate=True,
output_scores=True, max_length=150, do_sample=True, top_p=0.9, eos_token_id=50256, pad_token_id=0, num_return_sequences=5)
模型推断结束
end_time = time.perf_counter()
for idx, sentence in enumerate(generated_ids.sequences):
print('next sentence %d:\n' % idx,
tokenizer.decode(sentence).split('<|endoftext|>')[0])
print('' 40)
print("=" * 88)
print(f"inference time: {end_time - start_time} s")
my GPT2-LightSeq code:
def ls_gpt2(model, inputs, generation_method="topk"):
torch.cuda.synchronize()
start_time = time.perf_counter()
results = None
if generation_method == "topk" or generation_method == "topp":
results = model.sample(inputs, topk=32, topp=0.9)
elif generation_method == "ppl":
results = model.ppl(inputs)[0]
torch.cuda.synchronize()
end_time = time.perf_counter()
return results, end_time - start_time
def ls_generate(model, tokenizer, inputs):
print("=========lightseq=========")
print("lightseq generating...")
print("inputs.size(:", inputs.size())
ls_res_ids, ls_time = ls_gpt2(model, inputs)
ls_res = tokenizer.batch_decode(ls_res_ids, skip_special_tokens=True)
print(f"lightseq time: {ls_time}s")
print("lightseq results:")
for sent in ls_res:
print(sent)
model = lsi.Gpt("models/Wenzhong-GPT2-110M/lightseq_gpt2_base.hdf5", max_batch_size=5)
inputs = inputs.repeat(5, 1)
ls_generate(model, tokenizer, inputs)
— Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAMS5SWG2XALMKGICO3VS6VHZANCNFSM527VOS2Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I am trying to accelerate the inference of a Chinese GPT2 model with LightSeq. Although different sentences can be generated by configuring the topk and topp parameters, the length cannot be adjusted. The
model.sample
method can support the max_length parameter of themodel.generate
method similar to huggingface?my GPT2-HuggingFace code:
my GPT2-LightSeq code: