TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Before the attention operation the qkv tensors are implemented as one big tensor qkv, I would like to do some in-place operations for q and k only.
Currently what I do is following the code herequery, key, value = split(qkv, [self.attention_hidden_size, kv_size, kv_size], dim=2)to split the tensor, but I saw in the comment that it actually requires memory copy The slice layer selects for each dimension a start location from within the input tensor, and copies elements to the output tensor using a stride of 1 across the input tensor. which is not ideal in my case.
Wondering is there a way to get the sub-tensor without data copy? or trt-llm will optimize away the unnecessary data copies?
pytorch has this torch.narrow, is there a tllm equivalence?
Before the attention operation the qkv tensors are implemented as one big tensor
qkv
, I would like to do some in-place operations for q and k only.Currently what I do is following the code here
query, key, value = split(qkv, [self.attention_hidden_size, kv_size, kv_size], dim=2)
to split the tensor, but I saw in the comment that it actually requires memory copyThe slice layer selects for each dimension a start location from within the input tensor, and copies elements to the output tensor using a stride of 1 across the input tensor.
which is not ideal in my case.Wondering is there a way to get the sub-tensor without data copy? or trt-llm will optimize away the unnecessary data copies?
pytorch has this torch.narrow, is there a tllm equivalence?