ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

Out of memeory error for batch size more than 1 for T5 models. #60

Open Ki6an opened 2 years ago

Ki6an commented 2 years ago

hey, first of all, thanks for creating this amazing library!

I'm following your T5 implementation with trt, https://github.com/ELS-RD/transformer-deploy/blob/b52850dce004212225edcaa7b80fccc311398038/t5.py#L222

And, I'm trying to convert the onnx version of the T5 model to tensorrt engine using your build_engine method, https://github.com/ELS-RD/transformer-deploy/blob/1f2d2c1d8d0239fca7679f8c550a954ea1445cfa/src/transformer_deploy/backends/trt_utils.py#L64

It works fine for a batch size of 1, but for batch size > 1. it's taking longer to build (almost an hour just for the t5-small encoder), and even after that it's not building the model successfully and getting the following error :

[03/18/2022-12:51:55] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::161] Error Code 2: OutOfMemory (no further information)
[03/18/2022-12:51:55] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::161] Error Code 2: OutOfMemory (no further information)
[03/18/2022-12:51:55] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[encoder.embed_tokens.weight...Mul_406]}.)
[03/18/2022-12:51:55] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "export_onnx_to_trt.py", line 100, in <module>
    build_t5_engine(onnx_encoder_path, trt_encoder_path, [input_id_shape])
  File "export_onnx_to_trt.py", line 86, in build_t5_engine
    engine: ICudaEngine = build_engine(
  File "/app/utils.py", line 209, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f380bbf8930>, None

some system info if that helps;

the errors say I need more GPU memory, I was wondering how much GPU memory did you use for a batch size of 5? or maybe I'm missing something?

I would really appreciate any help, thank you!

victox5 commented 2 years ago

Thank you for the progress done on T5! :)

I have been following your t5 file, and I found some points that can hopefully help. (I do not want to bother with another thread as t5 is not merged yet).

I have not tried TRT yet so I cannot add to @Ki6an's post. I will update once I start playing with it. Thanks again for this great repo!

pommedeterresautee commented 2 years ago

@Ki6an hi, I am using between 10 and 20Gb of RAM (working on a 3090 RTX). I never experienced issue with batch > 1. What is the flavor of the model? (size) Have you tried to run script from Docker. It now includes Jupyter. It will provide all Nvidia dependencies up to date which - may - fix your problem.

@victox5 I am working on it. Basically I try to make a tool to set automatically the right precision on each node without following fixed pattern. Honestly it's not easy, it raises many other issues, and probably that the new transformer engine from H100 (https://blogs.nvidia.com/blog/2022/03/22/h100-transformer-engine/) announced a few hours ago is a better way :-) Still I will try to push something when it will start to be useful. On my GPU Pytorch + cache is faster than ONNX optimized around seq len 1024... and ONNX RT is faster than TensorRT (!!!) on long seq. The probable solution is to support caching on onnx but it will increase memory footprint by 50% :-(

pommedeterresautee commented 2 years ago

@Ki6an the T5 support is now official, you can check the notebook. It fixes the issue of double weights. Experiments are done with batch 4 and up to 1000 tokens on T5-large. We have also a trt script in the same folder, we keep it next to the notebook to keep code simple.

pommedeterresautee commented 2 years ago

closing because recent merged work should have fixed this issue. Don't hesitate to reopen the issue @Ki6an

Ki6an commented 2 years ago

@pommedeterresautee thanks, Do you plan on adding support for the Triton server?

pommedeterresautee commented 2 years ago

T5 work requires a good support of the If Onnx node, which has been recently added to Onnx Runtime (only master branch). Triton support will be added when Onnx Runtime 1.12 (somewhere in June) and Triton with Onnx Runtime 1.12 engine will be released.

Ki6an commented 2 years ago

Great! I tried T5 with cache (i.e. with past-key-values) on triton server. For generating every single token, the python backend was making lots of requests (24 pkv + 1 logits just for t5-small) to the decoder. Which is slowing down inference... Do you have suggestions for this? Or any ideas to get around this issue? thanks

pommedeterresautee commented 2 years ago

Current Triton version (including 22.05) used a version of ONNX Runtime which has not a good support of If node. Basically there is a bug which moves all CUDA input to CPU memory and back to CUDA (because of a wrong default policy) when those tensors are only consumed by a subgraph of the If node. It works but it's super slow. The bug is only fixed in master of ORT. So until you have recompiled Triton with ORT last version, you will see bad performance because of this bug only. You can see this behavior through Nsight + ORT 1.11.1 for instance.

Not sure to understand what you mean by "lots of requests" on Python? It should be one per generated token, at least that's what I would expect ;-)

Ki6an commented 2 years ago

Great find! Thanks for fixing the bug.

Sorry for the replying late on this.

As mentioned above, I'm trying to serve the T5 model from triton server. I have an encoder (ort-backend), decoder (ort-backend) and ensemble_t5 (that uses a python backend to preprocess the text and also to handle the huggingface api). I have converted the model with cache (i.e. with past key values) so my decoder takes 24 past-key-values, input_ids and encoder hidden states as input and outputs 24 pkv and logits.

For generating a single token, we need to send these (24 pkv + input_ids + encoder_hidden_states) inputs to the decoder and request (24 pkv and logits) as output. As you can see there is a lot of data movement between ensemble_t5 and the decoder. Which is making the model slower. I was asking if there is a better way to handle this ? and make the model fast.

thanks

pommedeterresautee commented 2 years ago

Triton uses dlpack to pass tensors from one backend to another, it's supposed to be close to cost free (it just wraps the tensor but there is no copy). Did you measured that the slowness was because of these transfers between backends ?

Ki6an commented 2 years ago

thanks for the response and tip,

the execution of the onnx model part is slow

inference_request = pb_utils.InferenceRequest(
    model_name=self.model_path,
    requested_output_names=["logits"] + self.output_pkv_names,
    inputs=[input_ids, encoder_attention_mask, encoder_hidden_states]
    + input_past_key_values,
)
inference_response = inference_request.exec()

I also noticed that the following part is also little slow...

  logits = T5Helper.get_output_tensors(inference_response, "logits")
  list_out_pkv = [
      T5Helper.get_output_tensors(inference_response, name)
      for name in self.output_pkv_names
  ]

here T5Helper is

class T5Helper:
    @staticmethod
    def get_output_tensors(inference_response, name):
        output = pb_utils.get_output_tensor_by_name(inference_response, name)
        tensor = torch.from_dlpack(output.to_dlpack())
        return tensor.cuda()
pommedeterresautee commented 2 years ago

Sorry to reask, just to be sure, you are using 2 decoders and 1 encoder, right?

Moreover, why do you need tensor.cuda() in get_output_tensors ? It should already be on GPU.

Is FORCE_CPU_ONLY_INPUT_TENSORS set to no ? https://github.com/triton-inference-server/python_backend#input-tensor-device-placement

Also, probably not your issue, but to improve things, a correct host RAM allocation helps: https://github.com/triton-inference-server/python_backend#managing-shared-memory

Ki6an commented 2 years ago

just to be sure, you are using 2 decoders and 1 encoder, right?

yes

why do you need tensor.cuda() in get_output_tensors ? It should already be on GPU.

for some reason, it's not placing the output tensors directly on GPU even though I set FORCE_CPU_ONLY_INPUT_TENSORS -> no in ensemble_t5 python backend.

this slowdown occurs when I place the whole pipeline (i.e tokens and model) on gpu

I tried keeping the model and tokens on the CPU, but this time the input part i.e

  input_ids = pb_utils.Tensor.from_dlpack("input_ids", torch.to_dlpack(input_ids))
  encoder_attention_mask = pb_utils.Tensor.from_dlpack(
      "encoder_attention_mask", torch.to_dlpack(attention_mask)
  )
  encoder_hidden_states = pb_utils.Tensor.from_dlpack(
      "encoder_hidden_states", torch.to_dlpack(encoder_output)
  )

  flat_past_key_values = functools.reduce(operator.iconcat, past_key_values, [])

  input_past_key_values = [
      pb_utils.Tensor.from_dlpack(name, torch.to_dlpack(tensor))
      for name, tensor in zip(self.input_pkv_names, flat_past_key_values)
  ]

is slow (almost half as slow but in overall speed comparison with the torch, its no improvement)

I'm always keeping --shm-size=1g when running the container.

pommedeterresautee commented 2 years ago

Ok I think you have found the culprit, if the tensor is provided on CPU, there is no way to get low latency. Would it be possible for you to share a minimal reproduction case? (including Triton config files) I can try to test it/dig on my side. May be something super simple without any decoder, just it retrieves the output of an Onnx model on Python and log some info (like tensor device pb_utils.Tensor.is_cpu() )

Also I would check if there is something in the Python code which does a to("cpu"), for instance if you have built your custom model, you can double check that it's moved on the right device: https://github.com/ELS-RD/transformer-deploy/blob/main/src/transformer_deploy/utils/generative_model.py#L94

pommedeterresautee commented 2 years ago

FYI, Triton 22.07 has been released. It fixes a bug where ORT tensors where always put in host memory (plus it's built with ORT 1.12.0 which have also it's own memory placement bug).

Updated code of this repo (there are some subtleties to manage, not just an update of the docker image):

https://github.com/ELS-RD/transformer-deploy/pull/116

it's in review, can't guarantee it works for everything

Let us know if it helps regarding your issue.