How to Convert Very large onnx model (macaw-11b - 40GB) into trt model?

sanxchep commented 2 years ago

I have a environment running python 3.8 in nvidia's official docker and have converted the initial macaw 11b model to onnx format. But when I try to load it and convert it to trt model (code below):

`t5_trt_encoder_engine = T5EncoderONNXFile( os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + ".engine")

t5_trt_decoder_engine = T5DecoderONNXFile( os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + ".engine")`

It shows the error - [04/19/2022-08:26:38] [TRT] [W] TensorRT was linked against cuBLAS/cuBLASLt 11.6.5 but loaded cuBLAS/cuBLASLt 11.6.1 [04/19/2022-08:38:03] [TRT] [W] Skipping tactic 0 due to Myelin error: CUDA error 2 for 1468006400-byte allocation. [04/19/2022-08:38:03] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[(Unnamed Layer* 13) [Constant] + (Unnamed Layer* 14) [Shuffle]...Mul_1395]}.) [04/19/2022-08:38:03] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

We are running on a 4 GPU instance (K80s) with 64 gigs of total GPU memory. But when we checked the usage, only one GPU (id-0) had a memory usage of 87% and no usage or memory consumption on the rest. Is there a way to properly parallelize it into multiple gpus the way normal torch imports.

ttyio commented 2 years ago

@sanxchep , currently we don't support native multi-GPUs support. It's the user's responsibility to manager multi-GPUs, split the model and run the sub graphs in pipeline. TensorRT is no special like other CUDA applications under this environment.

sanxchep commented 2 years ago

@sanxchep , currently we don't support native multi-GPUs support. It's the user's responsibility to manager multi-GPUs, split the model and run the sub graphs in pipeline. TensorRT is no special like other CUDA applications under this environment.

@ttyio If this is the case, can you direct me towards helpful documentation to do the same. I've noticed the trt docs aren't structured properly (an opinion, or maybe i dont have enough technical knowledge). So any links would be helpful.

Or else any other implementations under tensor RT that would have loaded large transformer model?

ttyio commented 2 years ago

@sanxchep yes we are working on improving the documents ;-(

Currently we have a single GPU T5 demo in https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace/T5, but sorry there is no demo show how to pipeline/tensor parallel the model yet.

The native support for multi node multi GPU is in the plan, before it is supported, Maybe worth try https://github.com/NVIDIA/FasterTransformer, it supports multile GPU. Thanks

sanxchep commented 2 years ago

@ttyio I understand, Thanks for the help! let me see if i can churn up something!

NVIDIA / TensorRT

How to Convert Very large onnx model (macaw-11b - 40GB) into trt model? #1937