Open dongluw opened 6 months ago
wondering is any update on this issue?
@QiJune would please take a look this question?
wondering is there any update on this issue?
I think using Slice layer is correct. It appears that at the API IR level it "copies" the tensor, but in reality TRT tries to avoid the copy whenever possible.
Hi @dongluw do u still have further issue or question now? If not, we'll close it soon.
Before the attention operation the qkv tensors are implemented as one big tensor
qkv
, I would like to do some in-place operations for q and k only.Currently what I do is following the code here
query, key, value = split(qkv, [self.attention_hidden_size, kv_size, kv_size], dim=2)
to split the tensor, but I saw in the comment that it actually requires memory copyThe slice layer selects for each dimension a start location from within the input tensor, and copies elements to the output tensor using a stride of 1 across the input tensor.
which is not ideal in my case.Wondering is there a way to get the sub-tensor without data copy? or trt-llm will optimize away the unnecessary data copies?
pytorch has this torch.narrow, is there a tllm equivalence?