NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.81k stars 2.13k forks source link

T5 Onnx to TensorRT conversion Error: Internal Error (encoder_hidden_states: for dimension number 2 in profile 0 does not match network definition (got min=2048, opt=2048, max=2048), expected min=opt=max=1024).) #2814

Open varunnathan opened 1 year ago

varunnathan commented 1 year ago

Description

Followed https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb to convert a custom fine-tuned T5-large model into TensorRT engine.

Points to note about T5-large model: -> Fine-tuned with a max_sequence_length of 2048

Error obtained while running the conversion script with a max_sequence_length of 2048: "Error Code 4: Internal Error (encoder_hidden_states: for dimension number 2 in profile 0 does not match network definition (got min=2048, opt=2048, max=2048), expected min=opt=max=1024).)"

Points to note about the conversion script: -> Works as expected when I use a max_sequence_length of 1024

Environment

TensorRT Version: 8.6.0.12 NVIDIA GPU: T4-16GB NVIDIA Driver Version: 515.65.01 CUDA Version: 12.0 CUDNN Version: 8.08 Operating System: Ubuntu 20.04.5 LTS Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.10.2+cu113 Baremetal or Container (if so, version):

Relevant Files

Attached the script I am using for converting my custom fine-tuned t5-large model into TensorRT format. This is based off of https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb. t5_inference_with_tensorrt.py.zip

Steps To Reproduce

  1. I create a Sagemaker notebook instance with 16 GB GPU memory
  2. Within the instance, I clone the TensorRT repo (https://github.com/NVIDIA/TensorRT.git) and run step 1 in "Downloading TensorRT Build" section
  3. I build the TensorRT OSS container with "./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.0"
  4. Then, I launch the container with "./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda12.0 --gpus all"
  5. Then, I install the requirements from TensorRT/demo/HuggingFace directory with "pip3 install -r requirements.txt"
  6. I then copy the model files from the instance into the docker container
  7. I make the following changes to the "TensorRT/demo/HuggingFace/T5/T5ModelConfig.py" file: a) MAX_SEQUENCE_LENGTH = { TARGET_MODELS[0]: 512, TARGET_MODELS[1]: 768, TARGET_MODELS[2]: 2048, # 1024 -> 2048 TARGET_MODELS[3]: 1024, TARGET_MODELS[4]: 1024, } b) MAX_OUTPUT_LENGTH = { TARGET_MODELS[0]: 512, TARGET_MODELS[1]: 768, TARGET_MODELS[2]: 512, # 1024 -> 512 TARGET_MODELS[3]: 1024, TARGET_MODELS[4]: 1024, }
  8. Then I step into an ipython environment and run the steps in the attached script (t5_inference_with_tensorrt.py)
  9. I get the following error at the triton decoder engine creation step: "t5_trt_decoder_engine = T5DecoderONNXFile(os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata).as_trt_engine( decoder_engine_name, profiles=[decoder_profile], preview_features=preview_features)"

Error Message:

[E] 4: [network.cpp::validate::3084] Error Code 4: Internal Error (encoder_hidden_states: for dimension number 2 in profile 0 does not match network definition (got min=2048, opt=2048, max=2048), expected min=opt=max=1024).) [!] Invalid Engine. Please ensure the engine was built correctly

PolygraphyException Traceback (most recent call last) Cell In[157], line 1 ----> 1 trt_engine = engine_from_network(network_definition, config=trt_inference_config)

File :3, in engine_from_network(network, config, save_timing_cache)

File /usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py:42, in BaseLoader.call(self, *args, *kwargs) 36 """ 37 Invokes the loader by forwarding arguments to call_impl. 38 39 Note: call_impl should not be called directly - use this function instead. 40 """ 41 doc = self.call_impl.doc ---> 42 return self.call_impl(args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py:530, in EngineFromNetwork.call_impl(self) 524 """ 525 Returns: 526 trt.ICudaEngine: The engine that was created. 527 """ 528 # We do not invoke super().call_impl here because we would otherwise be responsible 529 # for freeing it's return values. --> 530 return engine_from_bytes(super().call_impl)

File :3, in engine_from_bytes(serialized_engine)

File /usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py:42, in BaseLoader.call(self, *args, *kwargs) 36 """ 37 Invokes the loader by forwarding arguments to call_impl. 38 39 Note: call_impl should not be called directly - use this function instead. 40 """ 41 doc = self.call_impl.doc ---> 42 return self.call_impl(args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py:554, in EngineFromBytes.call_impl(self) 549 def call_impl(self): 550 """ 551 Returns: 552 trt.ICudaEngine: The deserialized engine. 553 """ --> 554 buffer, owns_buffer = util.invoke_if_callable(self._serialized_engine) 556 trt.init_libnvinfer_plugins(trt_util.get_trt_logger(), "") 557 with contextlib.ExitStack() as stack, trt.Runtime(trt_util.get_trt_logger()) as runtime:

File /usr/local/lib/python3.8/dist-packages/polygraphy/util/util.py:661, in invoke_if_callable(func,*args, *kwargs) 656 """ 657 Attempts to invoke a function with arguments. If func is not callable, then returns func 658 The second return value of this function indicates whether the argument was a callable. 659 """ 660 if callable(func): --> 661 ret = func(args, **kwargs) 662 return ret, True 663 return func, False

File /usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py:488, in EngineBytesFromNetwork.call_impl(self) 485 end_time = time.time() 487 if not engine_bytes: --> 488 G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly") 490 G_LOGGER.finish(f"Finished engine building in {end_time - start_time:.3f} seconds") 492 if self.timing_cache_path:

File /usr/local/lib/python3.8/dist-packages/polygraphy/logger/logger.py:597, in Logger.critical(self, message) 594 self.log(message, Logger.CRITICAL, stack_depth=3) 595 from polygraphy.exception import PolygraphyException --> 597 raise PolygraphyException(message) from None

PolygraphyException: Invalid Engine. Please ensure the engine was built correctly

My understanding of the error : The network definition is created off of the onnx model and that has 1024 as the third dimension (-1, -1, 1024) whereas I specify 2048 as the sequence length in "profile creation" which is fed as input to the onnx->Trt engine creation step. What I don't understand is that why is onnx model conversion step not considering this fact? Am I required to change anything in the HF model's config?

zerollzeng commented 1 year ago

@nvpohanh ^ ^

zerollzeng commented 1 year ago

We follow the huggingface config: https://huggingface.co/t5-large/blob/main/config.json#L15 Maybe create a new model instead of modifying the exists one?

varunnathan commented 1 year ago

Thanks for your reply @zerollzeng. The model I am trying to convert is fine-tuned from https://huggingface.co/t5-large with a max_seq_length of 2048 which when used for inference works with 2048 as sequence length. However, model.config.n_positions = 512

nvluxiaoz commented 1 year ago

Hello @varunnathan, so far we are not accepting customized models if they are not from HuggingFace. In our code, https://github.com/NVIDIA/TensorRT/blob/release/8.6/demo/HuggingFace/T5/trt.py#L133, we are using HuggingFace model config. The issue is here: https://huggingface.co/t5-large/blob/main/config.json#L7. d_model = 1024, so our TRT profile cannot be extended to 2048 unless you change that field. As @zerollzeng said, you may need to create a new model and use an updated HF config instead of using existing ones.

varunnathan commented 1 year ago

Thanks for your suggestion @nvluxiaoz . The issue is that when the HF model is fine-tuned, its config isn't updated. Let me see if I can make the conversion for this model work by updating the value of d_model key in its config.