NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.81k stars 889 forks source link

[bug] gptneox decouped wrong output length #704

Open RobotGF opened 1 year ago

RobotGF commented 1 year ago

when using fastertransformer_backend decouped mode is True, the output will diff with decouped is False. And the output length is wrong

Branch/Tag/Commit

main

Docker Image Version

triton-py3-22.12

GPU name

A40/3090

CUDA Driver

525.105.17

Reproduced Steps

gptneox fastertransformer_backend
model_transaction_policy {
  decoupled: True
}

get streaming output_length wrong
e.g 48,50,52,54
expected 48,49,50,51
RobotGF commented 1 year ago
if (token_generated_cb_ && step + 1 < (int)max_output_seq_len) {
            setOutputTensors(output_tensors, input_tensors, max_input_length, max_output_seq_len);
            sendTensorsToFirstPipelineNode(output_tensors, input_tensors);
            if (pipeline_para_.rank_ == 0 && tensor_para_.rank_ == 0) {
                token_generated_cb_(output_tensors, token_generated_ctx_);
            }
        }

in function setOutputTensors(output_tensors, input_tensors, max_input_length, max_output_seq_len);

 // add sequence_length 1 here because the sequence_length of time step t is t - 1
        param.max_sequence_length_final_step = 1;

this code is only for no decouped mode, when decouped is True, It will generate wrong output length