NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

What is the best way to get a sub-tensor without data copy? #1540

Open dongluw opened 6 months ago

dongluw commented 6 months ago

Before the attention operation the qkv tensors are implemented as one big tensor qkv, I would like to do some in-place operations for q and k only.

Currently what I do is following the code here query, key, value = split(qkv, [self.attention_hidden_size, kv_size, kv_size], dim=2)to split the tensor, but I saw in the comment that it actually requires memory copy The slice layer selects for each dimension a start location from within the input tensor, and copies elements to the output tensor using a stride of 1 across the input tensor. which is not ideal in my case.

Wondering is there a way to get the sub-tensor without data copy? or trt-llm will optimize away the unnecessary data copies?

pytorch has this torch.narrow, is there a tllm equivalence?

dongluw commented 5 months ago

wondering is any update on this issue?

nv-guomingz commented 5 months ago

@QiJune would please take a look this question?

dongluw commented 5 months ago

wondering is there any update on this issue?

nvpohanh commented 1 week ago

I think using Slice layer is correct. It appears that at the API IR level it "copies" the tensor, but in reality TRT tries to avoid the copy whenever possible.

nv-guomingz commented 3 days ago

Hi @dongluw do u still have further issue or question now? If not, we'll close it soon.