bytedance / lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation
Other
3.22k stars 329 forks source link

How to generate longer sentences #336

Closed koking0 closed 7 months ago

koking0 commented 2 years ago

I am trying to accelerate the inference of a Chinese GPT2 model with LightSeq. Although different sentences can be generated by configuring the topk and topp parameters, the length cannot be adjusted. The model.sample method can support the max_length parameter of the model.generate method similar to huggingface?

my GPT2-HuggingFace code:

hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"
tokenizer = GPT2Tokenizer.from_pretrained(hf_model_path)

question = "西湖的景色"
inputs = tokenizer(question, return_tensors='pt')["input_ids"]

model = GPT2LMHeadModel.from_pretrained(hf_model_path)
model.save_pretrained("./models/Wenzhong-GPT2-110M")
# 模型推断开始
start_time = time.perf_counter()
generated_ids = model.generate(inputs, return_dict_in_generate=True,
                                 output_scores=True, max_length=150, do_sample=True, top_p=0.9, eos_token_id=50256,
                                 pad_token_id=0, num_return_sequences=5)
# 模型推断结束
end_time = time.perf_counter()

for idx, sentence in enumerate(generated_ids.sequences):
print('next sentence %d:\n' % idx,
            tokenizer.decode(sentence).split('<|endoftext|>')[0])
print('*' * 40)

print("=" * 88)
print(f"inference time: {end_time - start_time} s")

my GPT2-LightSeq code:

def ls_gpt2(model, inputs, generation_method="topk"):
    torch.cuda.synchronize()
    start_time = time.perf_counter()
    results = None
    if generation_method == "topk" or generation_method == "topp":
        results = model.sample(inputs, topk=32, topp=0.9)
    elif generation_method == "ppl":
        results = model.ppl(inputs)[0]
    torch.cuda.synchronize()
    end_time = time.perf_counter()
    return results, end_time - start_time

def ls_generate(model, tokenizer, inputs):
    print("=========lightseq=========")
    print("lightseq generating...")
    print("inputs.size(:", inputs.size())
    ls_res_ids, ls_time = ls_gpt2(model, inputs)
    ls_res = tokenizer.batch_decode(ls_res_ids, skip_special_tokens=True)
    print(f"lightseq time: {ls_time}s")
    print("lightseq results:")
    for sent in ls_res:
        print(sent)

model = lsi.Gpt("models/Wenzhong-GPT2-110M/lightseq_gpt2_base.hdf5", max_batch_size=5)
inputs = inputs.repeat(5, 1)
ls_generate(model, tokenizer, inputs)
godweiyang commented 2 years ago

You can modify the following code: https://github.com/bytedance/lightseq/blob/master/examples/inference/python/export/huggingface/hf_gpt2_export.py#L165

Taka152 commented 2 years ago

I'm not sure about the function of max_step in huggingface, if you want to stop generation and get the results that may not ended correctly, you could change the max_step in examples/inference/python/export/huggingface/hf_gpt2_export.py

On Fri, Jul 8, 2022 at 12:30 PM Alex_996 @.***> wrote:

I am trying to accelerate the inference of a Chinese GPT2 model with LightSeq. Although different sentences can be generated by configuring the topk and topp parameters, the length cannot be adjusted. The model.sample method can support the max_length parameter of the model.generate method similar to huggingface?

my GPT2-HuggingFace code:

hf_model_path = "IDEA-CCNL/Wenzhong-GPT2-110M"

tokenizer = GPT2Tokenizer.from_pretrained(hf_model_path)

question = "西湖的景色"

inputs = tokenizer(question, return_tensors='pt')["input_ids"]

model = GPT2LMHeadModel.from_pretrained(hf_model_path)

model.save_pretrained("./models/Wenzhong-GPT2-110M")

模型推断开始

start_time = time.perf_counter()

generated_ids = model.generate(inputs, return_dict_in_generate=True,

                           output_scores=True, max_length=150, do_sample=True, top_p=0.9, eos_token_id=50256,

                           pad_token_id=0, num_return_sequences=5)

模型推断结束

end_time = time.perf_counter()

for idx, sentence in enumerate(generated_ids.sequences):

print('next sentence %d:\n' % idx,

      tokenizer.decode(sentence).split('<|endoftext|>')[0])

print('' 40)

print("=" * 88)

print(f"inference time: {end_time - start_time} s")

my GPT2-LightSeq code:

def ls_gpt2(model, inputs, generation_method="topk"):

torch.cuda.synchronize()

start_time = time.perf_counter()

results = None

if generation_method == "topk" or generation_method == "topp":

  results = model.sample(inputs, topk=32, topp=0.9)

elif generation_method == "ppl":

  results = model.ppl(inputs)[0]

torch.cuda.synchronize()

end_time = time.perf_counter()

return results, end_time - start_time

def ls_generate(model, tokenizer, inputs):

print("=========lightseq=========")

print("lightseq generating...")

print("inputs.size(:", inputs.size())

ls_res_ids, ls_time = ls_gpt2(model, inputs)

ls_res = tokenizer.batch_decode(ls_res_ids, skip_special_tokens=True)

print(f"lightseq time: {ls_time}s")

print("lightseq results:")

for sent in ls_res:

  print(sent)

model = lsi.Gpt("models/Wenzhong-GPT2-110M/lightseq_gpt2_base.hdf5", max_batch_size=5)

inputs = inputs.repeat(5, 1)

ls_generate(model, tokenizer, inputs)

— Reply to this email directly, view it on GitHub https://github.com/bytedance/lightseq/issues/336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELIZAMS5SWG2XALMKGICO3VS6VHZANCNFSM527VOS2Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>