Open Ki6an opened 2 years ago
Thank you for the progress done on T5! :)
I have been following your t5 file, and I found some points that can hopefully help. (I do not want to bother with another thread as t5 is not merged yet).
When using t5-large
the tensors deviate.
https://github.com/ELS-RD/transformer-deploy/blob/b52850dce004212225edcaa7b80fccc311398038/t5.py#L62 Assertions start to fail.
The assertions work if we use the ONNX non-optimized.
https://github.com/ELS-RD/transformer-deploy/blob/b52850dce004212225edcaa7b80fccc311398038/t5.py#L55 changed to enc_onnx = create_model_for_provider("test-enc.onnx", "CUDAExecutionProvider")
and https://github.com/ELS-RD/transformer-deploy/blob/b52850dce004212225edcaa7b80fccc311398038/t5.py#L94
to dec_onnx = create_model_for_provider("test-dec-opt.onnx", "CUDAExecutionProvider")
.
If you ignore the assertions and go on with the optimized ONNX, you cannot replicate output from vanilla PyTorch. Expected as the assertions were already warning us of considerable deviation from PyTorch vanilla.
The running time of 1 sample ONNX non-optimized/optimized is faster than vanilla PyTorch. However, when we process batches ONNX non-optimized/optimized is way slower than vanilla PyTorch. If we do truncation to the shortest sample, then ONNX gets faster than PyTorch vanilla. Not sure why the padding affects so much the running times (?).
I have not tried TRT yet so I cannot add to @Ki6an's post. I will update once I start playing with it. Thanks again for this great repo!
@Ki6an hi, I am using between 10 and 20Gb of RAM (working on a 3090 RTX). I never experienced issue with batch > 1. What is the flavor of the model? (size) Have you tried to run script from Docker. It now includes Jupyter. It will provide all Nvidia dependencies up to date which - may - fix your problem.
@victox5 I am working on it. Basically I try to make a tool to set automatically the right precision on each node without following fixed pattern. Honestly it's not easy, it raises many other issues, and probably that the new transformer engine from H100 (https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/) announced a few hours ago is a better way :-) Still I will try to push something when it will start to be useful. On my GPU Pytorch + cache is faster than ONNX optimized around seq len 1024... and ONNX RT is faster than TensorRT (!!!) on long seq. The probable solution is to support caching on onnx but it will increase memory footprint by 50% :-(
@Ki6an the T5 support is now official, you can check the notebook. It fixes the issue of double weights. Experiments are done with batch 4 and up to 1000 tokens on T5-large. We have also a trt script in the same folder, we keep it next to the notebook to keep code simple.
closing because recent merged work should have fixed this issue. Don't hesitate to reopen the issue @Ki6an
@pommedeterresautee thanks, Do you plan on adding support for the Triton server?
T5 work requires a good support of the If
Onnx node, which has been recently added to Onnx Runtime (only master
branch).
Triton support will be added when Onnx Runtime 1.12 (somewhere in June) and Triton with Onnx Runtime 1.12 engine will be released.
Great!
I tried T5 with cache (i.e. with past-key-values
) on triton server
. For generating every single token, the python backend was making lots of requests (24 pkv + 1 logits
just for t5-small
) to the decoder
. Which is slowing down inference... Do you have suggestions for this? Or any ideas to get around this issue?
thanks
Current Triton version (including 22.05) used a version of ONNX Runtime which has not a good support of If
node. Basically there is a bug which moves all CUDA input to CPU memory and back to CUDA (because of a wrong default policy) when those tensors are only consumed by a subgraph of the If
node. It works but it's super slow. The bug is only fixed in master
of ORT. So until you have recompiled Triton with ORT last version, you will see bad performance because of this bug only. You can see this behavior through Nsight + ORT 1.11.1 for instance.
Not sure to understand what you mean by "lots of requests" on Python? It should be one per generated token, at least that's what I would expect ;-)
Great find! Thanks for fixing the bug.
Sorry for the replying late on this.
As mentioned above, I'm trying to serve the T5 model from triton server. I have an encoder (ort-backend), decoder (ort-backend) and ensemble_t5 (that uses a python backend to preprocess the text and also to handle the huggingface api). I have converted the model with cache (i.e. with past key values) so my decoder takes 24 past-key-values, input_ids and encoder hidden states as input and outputs 24 pkv and logits.
For generating a single token, we need to send these (24 pkv + input_ids + encoder_hidden_states) inputs to the decoder and request (24 pkv and logits) as output. As you can see there is a lot of data movement between ensemble_t5 and the decoder. Which is making the model slower. I was asking if there is a better way to handle this ? and make the model fast.
thanks
Triton uses dlpack to pass tensors from one backend to another, it's supposed to be close to cost free (it just wraps the tensor but there is no copy). Did you measured that the slowness was because of these transfers between backends ?
thanks for the response and tip,
the execution of the onnx model part is slow
inference_request = pb_utils.InferenceRequest(
model_name=self.model_path,
requested_output_names=["logits"] + self.output_pkv_names,
inputs=[input_ids, encoder_attention_mask, encoder_hidden_states]
+ input_past_key_values,
)
inference_response = inference_request.exec()
I also noticed that the following part is also little slow...
logits = T5Helper.get_output_tensors(inference_response, "logits")
list_out_pkv = [
T5Helper.get_output_tensors(inference_response, name)
for name in self.output_pkv_names
]
here T5Helper
is
class T5Helper:
@staticmethod
def get_output_tensors(inference_response, name):
output = pb_utils.get_output_tensor_by_name(inference_response, name)
tensor = torch.from_dlpack(output.to_dlpack())
return tensor.cuda()
Sorry to reask, just to be sure, you are using 2 decoders and 1 encoder, right?
Moreover, why do you need tensor.cuda()
in get_output_tensors
? It should already be on GPU.
Is FORCE_CPU_ONLY_INPUT_TENSORS
set to no
?
https://github.com/triton-inference-server/python_backend#input-tensor-device-placement
Also, probably not your issue, but to improve things, a correct host RAM allocation helps: https://github.com/triton-inference-server/python_backend#managing-shared-memory
just to be sure, you are using 2 decoders and 1 encoder, right?
yes
why do you need tensor.cuda() in get_output_tensors ? It should already be on GPU.
for some reason, it's not placing the output tensors directly on GPU even though I set FORCE_CPU_ONLY_INPUT_TENSORS -> no
in ensemble_t5 python backend.
this slowdown occurs when I place the whole pipeline (i.e tokens and model) on gpu
I tried keeping the model and tokens on the CPU, but this time the input part i.e
input_ids = pb_utils.Tensor.from_dlpack("input_ids", torch.to_dlpack(input_ids))
encoder_attention_mask = pb_utils.Tensor.from_dlpack(
"encoder_attention_mask", torch.to_dlpack(attention_mask)
)
encoder_hidden_states = pb_utils.Tensor.from_dlpack(
"encoder_hidden_states", torch.to_dlpack(encoder_output)
)
flat_past_key_values = functools.reduce(operator.iconcat, past_key_values, [])
input_past_key_values = [
pb_utils.Tensor.from_dlpack(name, torch.to_dlpack(tensor))
for name, tensor in zip(self.input_pkv_names, flat_past_key_values)
]
is slow (almost half as slow but in overall speed comparison with the torch, its no improvement)
I'm always keeping --shm-size=1g
when running the container.
Ok I think you have found the culprit, if the tensor is provided on CPU, there is no way to get low latency.
Would it be possible for you to share a minimal reproduction case? (including Triton config files) I can try to test it/dig on my side.
May be something super simple without any decoder, just it retrieves the output of an Onnx model on Python and log some info (like tensor device pb_utils.Tensor.is_cpu()
)
Also I would check if there is something in the Python code which does a to("cpu"), for instance if you have built your custom model, you can double check that it's moved on the right device: https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/utils/generative_model.py#L94
FYI, Triton 22.07 has been released. It fixes a bug where ORT tensors where always put in host memory (plus it's built with ORT 1.12.0 which have also it's own memory placement bug).
Updated code of this repo (there are some subtleties to manage, not just an update of the docker image):
https://github.com/ELS-RD/transformer-deploy/pull/116
it's in review, can't guarantee it works for everything
Let us know if it helps regarding your issue.
hey, first of all, thanks for creating this amazing library!
I'm following your T5 implementation with trt, https://github.com/ELS-RD/transformer-deploy/blob/b52850dce004212225edcaa7b80fccc311398038/t5.py#L222
And, I'm trying to convert the onnx version of the T5 model to tensorrt engine using your
build_engine
method, https://github.com/ELS-RD/transformer-deploy/blob/1f2d2c1d8d0239fca7679f8c550a954ea1445cfa/src/transformer_deploy/backends/trt_utils.py#L64It works fine for a batch size of 1, but for
batch size > 1
. it's taking longer to build (almost an hour just for the t5-small encoder), and even after that it's not building the model successfully and getting the following error :some system info if that helps;
trt+cuda - 8.2.1-1+cuda11.4
os - ubuntu 20.04.3
gpu - T4 with 15GB memory
the errors say I need more GPU memory, I was wondering how much GPU memory did you use for a batch size of 5? or maybe I'm missing something?
I would really appreciate any help, thank you!