NetEase-FuXi / EET

Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model
Apache License 2.0
259 stars 46 forks source link

gpt2 text generation pipeline batch generation output #4

Closed C00reNUT closed 2 years ago

C00reNUT commented 2 years ago

Hello, I am trying to use text-generation pipeline using docker image with these parameters:

import torch
from eet import pipeline
max_batch_size = 16
data_type = torch.float16
input = "My name is Sarah and I live in London"
nlp = pipeline("text-generation", model = 'gpt2-medium', data_type = data_type, max_batch_size = max_batch_size, model_kwargs = {'nsamples':'1024', 'top_k':'40', 'temperature':'0.5', 'length':'30'})
out = nlp(input)
print(len(out))
print(out)

after the execution I get the following output:

There are 0 buffer in cache vector
Request a cache of size : 8388608
There are 1 buffer in cache vector
Request a cache of size : 8388608
There are 2 buffer in cache vector
Request a cache of size : 8388608
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
There are 0 buffer in vector
Request a buffer of size : 25165824
There are 1 buffer in vector
Request a buffer of size : 8388608
There are 2 buffer in vector
Request a buffer of size : 8388608
There are 3 buffer in vector
Request a buffer of size : 8388608
There are 4 buffer in vector
Request a buffer of size : 67108864
1
[{'generated_text': "My name is Sarah and I live in London. I'm 29 and I would like to join the Army. I'd like it to be to my benefit. What was your experience of joining the Army?\n\nSarah Waugh: My first impressions"}]

That means that instead of 1024 samples (passing 'nsamples':'1024' parameter) I am still getting just one output. Is there something I am missing here?

dingjingzhen commented 2 years ago

There is no parameter called "nsample",I guess you want to output 1000 different generated texts,if you want to implement this functionality, you can do so by setting num_return_sequences.

out = nlp(input ,num_return_sequences=1000)

This approach actually puts the input text into 1000 batches, so you need to set the max_batch_size to be greater than or equal to 1000, and the practice of batch_size=1000 is too rare, time consuming and excessive memory usage.

dingjingzhen commented 2 years ago

When I use num_return_sequences I found a bug, you can update the eet to solve it, thank you very much for using.If you have other needs you can always talk to us.

C00reNUT commented 2 years ago

Thank you, now the num_return_sequences works and I can generate multiple samples.

However I still don't understand this:

This approach actually puts the input text into 1000 batches, so you need to set the max_batch_size to be greater than or equal to 1000, and the practice of batch_size=1000 is too rare, time consuming and excessive memory usage.

I can do max_batch_size = 10 and num_return_sequences=50 and it it returns 50 results.

import torch
from eet import pipeline

max_batch_size = 10
data_type = torch.float16
input = "My name is Sarah and I live in London"
nlp = pipeline("text-generation", model = 'gpt2-medium', data_type = data_type, max_batch_size = max_batch_size)
out = nlp(input, num_return_sequences=50)
print(len(out))
print(out)

But when I use max_batch_size = 10 and num_return_sequences=500 it should return 500 examples, but the script crashes with error

import torch
from eet import pipeline

max_batch_size = 10
data_type = torch.float16
input = "My name is Sarah and I live in London"
nlp = pipeline("text-generation", model = 'gpt2-medium', data_type = data_type, max_batch_size = max_batch_size)
out = nlp(input, num_return_sequences=500)
print(len(out))
print(out)
RuntimeError                              Traceback (most recent call last)
Input In [2], in <cell line: 8>()
      6 input = "My name is Sarah and I live in London"
      7 nlp = pipeline("text-generation", model = 'gpt2-medium', data_type = data_type, max_batch_size = max_batch_size)
----> 8 out = nlp(input, num_return_sequences=500)
      9 print(len(out))
     10 print(out)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/text_generation.py:113, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
    112 def __call__(self, text_inputs, **kwargs):
--> 113     return super().__call__(text_inputs, **kwargs)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/base.py:427, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
    425     return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
    426 else:
--> 427     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/base.py:434, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
    432 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
    433     model_inputs = self.preprocess(inputs, **preprocess_params)
--> 434     model_outputs = self.forward(model_inputs, **forward_params)
    435     outputs = self.postprocess(model_outputs, **postprocess_params)
    436     return outputs

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/base.py:357, in Pipeline.forward(self, model_inputs, **forward_params)
    352 def forward(self, model_inputs, **forward_params):
    353     # with self.device_placement():
    354     #     inference_context = self.get_inference_context()
    355     #     with inference_context():
    356     model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
--> 357     model_outputs = self._forward(model_inputs, **forward_params)
    358     model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
    360     return model_outputs

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/text_generation.py:151, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
    149     in_b = input_ids.shape[0]
    150 prompt_text = model_inputs.pop("prompt_text")
--> 151 generated_sequence = self.model.generate(input_ids=input_ids, **generate_kwargs)  # BS x SL
    152 out_b = generated_sequence.shape[0]
    153 generated_sequence = generated_sequence.reshape(in_b, out_b // in_b, *generated_sequence.shape[1:])

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/generation.py:346, in GenerationMixin_EET.generate(self, inputs, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, typical_p, repetition_penalty, bad_words_ids, force_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, logits_processor, stopping_criteria, constraints, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, exponential_decay_length_penalty, **model_kwargs)
    338     input_ids, model_kwargs = self._expand_inputs_for_generation(
    339         input_ids,
    340         expand_size=num_return_sequences,
    341         is_encoder_decoder=self.config.is_encoder_decoder,
    342         **model_kwargs,
    343     )
    345     # 12. run sample
--> 346     return self.sample(
    347         input_ids,
    348         logits_processor=logits_processor,
    349         logits_warper=logits_warper,
    350         stopping_criteria=stopping_criteria,
    351         pad_token_id=pad_token_id,
    352         eos_token_id=eos_token_id,
    353         output_scores=output_scores,
    354         return_dict_in_generate=return_dict_in_generate,
    355         synced_gpus=synced_gpus,
    356         **model_kwargs,
    357     )
    359 elif is_beam_gen_mode:
    360     if num_return_sequences > num_beams:

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/EET/lib/python3.8/site-packages/eet/pipelines/generation.py:789, in GenerationMixin_EET.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
    787 # sample
    788 probs = nn.functional.softmax(next_token_scores, dim=-1)
--> 789 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
    791 # finished sentences should have their next token be a padding token
    792 if eos_token_id is not None:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Usually I can generate any amount of samples (in this case represented by num_return_sequences) as long as it is divisible by the max_batch_size. That means that I should be able to produce 50, 500 or 5000 samples with max_batch_size = 10, it will just take longer to generate.

Am I missing something? Is there some way to generate that many samples without running out of memory?

C00reNUT commented 2 years ago

Ok, I looked at the code and now I get it, it is implemented somehow differently than I thought. Thank you for this nice library.