NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Apache License 2.0
10.89k stars 2.14k forks source link

How to Convert Very large onnx model (macaw-11b - 40GB) into trt model? #1937

Closed sanxchep closed 2 years ago

sanxchep commented 2 years ago

I have a environment running python 3.8 in nvidia's official docker and have converted the initial macaw 11b model to onnx format. But when I try to load it and convert it to trt model (code below):

`t5_trt_encoder_engine = T5EncoderONNXFile( os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + ".engine")

t5_trt_decoder_engine = T5DecoderONNXFile( os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + ".engine")`

It shows the error - [04/19/2022-08:26:38] [TRT] [W] TensorRT was linked against cuBLAS/cuBLASLt 11.6.5 but loaded cuBLAS/cuBLASLt 11.6.1 [04/19/2022-08:38:03] [TRT] [W] Skipping tactic 0 due to Myelin error: CUDA error 2 for 1468006400-byte allocation. [04/19/2022-08:38:03] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[(Unnamed Layer* 13) [Constant] + (Unnamed Layer* 14) [Shuffle]...Mul_1395]}.) [04/19/2022-08:38:03] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

We are running on a 4 GPU instance (K80s) with 64 gigs of total GPU memory. But when we checked the usage, only one GPU (id-0) had a memory usage of 87% and no usage or memory consumption on the rest. Is there a way to properly parallelize it into multiple gpus the way normal torch imports.

ttyio commented 2 years ago

@sanxchep , currently we don't support native multi-GPUs support. It's the user's responsibility to manager multi-GPUs, split the model and run the sub graphs in pipeline. TensorRT is no special like other CUDA applications under this environment.

sanxchep commented 2 years ago

@sanxchep , currently we don't support native multi-GPUs support. It's the user's responsibility to manager multi-GPUs, split the model and run the sub graphs in pipeline. TensorRT is no special like other CUDA applications under this environment.

@ttyio If this is the case, can you direct me towards helpful documentation to do the same. I've noticed the trt docs aren't structured properly (an opinion, or maybe i dont have enough technical knowledge). So any links would be helpful.

Or else any other implementations under tensor RT that would have loaded large transformer model?

ttyio commented 2 years ago

@sanxchep yes we are working on improving the documents ;-(

Currently we have a single GPU T5 demo in https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace/T5, but sorry there is no demo show how to pipeline/tensor parallel the model yet.

The native support for multi node multi GPU is in the plan, before it is supported, Maybe worth try https://github.com/NVIDIA/FasterTransformer, it supports multile GPU. Thanks

sanxchep commented 2 years ago

@ttyio I understand, Thanks for the help! let me see if i can churn up something!