Open akhoroshev opened 1 year ago
This is a know issue, please refer this issue https://github.com/NVIDIA/FasterTransformer/pull/487.
I used the tmp/fix_gpt_earlystop branch, but still have the same problem
I used the tmp/fix_gpt_earlystop branch, but still have the same problem
Please provide your reproduced steps.
The issue is fixed in the MR https://github.com/NVIDIA/FasterTransformer/pull/584 and merge in main branch. Sorry for the late fixing.
Hello, I wonder if #584 also applies to GPT-J?
I am testing inferences with tritonserver's fastertransformer backend and GPT-J converted model, and it takes as much as time in proportion to request_output_len.
If I request 30 tokens, it takses about 150ms, and last 5 tokens are eos tokens. And if I request 330 tokens, it takes about 1400ms, and still last 305 tokens are all eos tokens.
@byshiue thanks for fixing in parallel gpt 😌
gptneox and gptj have the same bug? When I tested gptneox I met an same bug...
Do you have plans for fix them?
@byshiue Thanks for fixing the problem.
but I found that when running Bloom on a single GPU, it still cannot stop after generating the eos token, and it will not stop until the maximum length.
I added a check for finished_buf_
after "result sampling and stop check"
, and the test passed. I don't know if it is correct in parallel scenario, can you please review it?
ParallelGpt.cc:
PUSH_RANGE("result sampling and stop check");
dynamic_decode_layer_->forward(&dynamic_decode_output_tensors, &dynamic_decode_input_tensors);
+// check finished
+cudaD2Hcpy(h_finished_buf_, finished_buf_, batch_size * beam_width);
+uint sum = 0;
+for (uint i = 0; i < batch_size * beam_width; i++) {
+ sum += (int)h_finished_buf_[i];
+}
+if (sum == batch_size * beam_width) {
+ subbatch_should_stop = true;
+}
*generation_should_stop_ &= subbatch_should_stop;
POP_RANGE;
I play with
examples/cpp/gpt/gpt_example.cc
and found that generation of tokens does't finish when the first EOD is reached.This is my gpt_config.ini. I use gpt2 model which was converted into FT format
For example if I set
and feed tokens [4919, 389] as input
I receive log output
and result (103 sensible tokens at the beginning and 27 eod in the tail) whole computation took
149.00 ms
BUT
when I set
I receive log output
and result (103 sensible tokens (same as previous run, it is ok because seed is fixed) at the beginning and 899 eod in the tail) whole computation took
810.24 ms
The question is why the generation of EOD tokens consumes time?