NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.31k stars 931 forks source link

Access to context hidden states #595

Open akhoroshev opened 10 months ago

akhoroshev commented 10 months ago

Is there a way to access to context hidden states? I mean tensor with shape [batch_size, max_input_token_num, hidden_size]? In FasterTransformers it was easy. At this point (after context decoding phase) I just access the tensor (decoder_output_tensors["decoder_output"]).

byshiue commented 10 months ago

You could mark it as output when you build the engine. Use GPT as example, you could mark the hidden_states here as output.

akhoroshev commented 9 months ago

HI @byshiue !

For experiment I added line here

hidden_states.mark_output('hidden_states_output_test', self.dtype)

After that I created engine and run gptManagerBenchmark.

Also I modified gptManagerBenchmark to see output tensors

        for (auto& tensor: response_tensors) {
            TLLM_LOG_INFO(tensor.name);
            auto shape = tensor.tensor->getShape();
            TLLM_LOG_INFO("shape");
            std::cout << "[";
            for (auto i = 0; i < shape.nbDims; i++)
              std::cout << shape.d[i] << ", ";
            std::cout << "]" << std::endl;
            TLLM_LOG_INFO("type");
            auto type = tensor.tensor->getMemoryType();
            std::cout << static_cast<int32_t>(type) << std::endl;
        }

And I only can see "default" output tensors in output

[TensorRT-LLM][INFO] output_ids
[TensorRT-LLM][INFO] shape
[1, 1, 1213, ]
[TensorRT-LLM][INFO] type
1
[TensorRT-LLM][INFO] sequence_length
[TensorRT-LLM][INFO] shape
[1, 1, ]
[TensorRT-LLM][INFO] type
1
[TensorRT-LLM][INFO] output_log_probs
[TensorRT-LLM][INFO] shape
[1, 1, 1024, ]
[TensorRT-LLM][INFO] type
1
[TensorRT-LLM][INFO] cum_log_probs
[TensorRT-LLM][INFO] shape
[1, 1, ]
[TensorRT-LLM][INFO] type
1

Is it possible to forward "custom" output tensor for GptManager?

byshiue commented 9 months ago

It might be hard to add in c++ runtime, you could try adding on python runtime first.

deidaraho commented 3 weeks ago

It might be hard to add in c++ runtime, you could try adding on python runtime first.

I have the same question. If the python code get modified as your above suggestion, then I build the gpt model by trtllm. Can the hidden_states get passed into postprocessing part?