Open Civitasv opened 1 year ago
I'm using TVM's BYOC approach to integrate TVM and TensorRT. But I observe a huge significant performance gap between them.
Firstly, I am using the VAE decoder model. Upon using TensorRT directly, the performance summary is as follows:
[07/21/2023-17:37:24] [I] === Performance summary === [07/21/2023-17:37:24] [I] Throughput: 29.5276 qps [07/21/2023-17:37:24] [I] Latency: min = 32.5517 ms, max = 41.0814 ms, mean = 33.5797 ms, median = 33.2942 ms, percentile(90%) = 34.4155 ms, percentile(95%) = 34.7879 ms, percentile(99%) = 41.0814 ms [07/21/2023-17:37:24] [I] Enqueue Time: min = 1.86743 ms, max = 3.68506 ms, mean = 3.11519 ms, median = 3.28667 ms, percentile(90%) = 3.34058 ms, percentile(95%) = 3.36389 ms, percentile(99%) = 3.68506 ms [07/21/2023-17:37:24] [I] H2D Latency: min = 0.0218506 ms, max = 0.0581055 ms, mean = 0.0320333 ms, median = 0.0335693 ms, percentile(90%) = 0.0351562 ms, percentile(95%) = 0.0424805 ms, percentile(99%) = 0.0581055 ms [07/21/2023-17:37:24] [I] GPU Compute Time: min = 32.4423 ms, max = 40.9753 ms, mean = 33.4738 ms, median = 33.1878 ms, percentile(90%) = 34.3101 ms, percentile(95%) = 34.689 ms, percentile(99%) = 40.9753 ms [07/21/2023-17:37:24] [I] D2H Latency: min = 0.0646973 ms, max = 0.0771484 ms, mean = 0.0739099 ms, median = 0.0740356 ms, percentile(90%) = 0.0749512 ms, percentile(95%) = 0.0751953 ms, percentile(99%) = 0.0771484 ms
You can see it is about 34ms.
But when I use TVM's BYOC strategy using the following script:
from tvm.relay.op.contrib import tensorrt ... mod = tensorrt.partition_for_tensorrt(mod, params, target) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target="cuda", params=params) # load parameters dev = tvm.cuda(0) module_exec = runtime.GraphModule(lib["default"](dev)) print(module_exec.benchmark(dev, number=1, repeat=10))
It shows me like:
Execution time summary: mean (ms) median (ms) max (ms) min (ms) std (ms) 101.8611 101.3974 103.1465 101.3486 0.6918
You can see it is about 101ms.
And I can confirm the BYOC of TensorRT works, the runtime module structure looks like:
Runtime module structure: Module(const_loader, 3d6d3e8) |- Module(llvm, d523f68) |- Module(tensorrt, d71b768) |- Module(tensorrt, d7425f8) |- Module(tensorrt, 36c1758) |- Module(tensorrt, 4ad32c8) |- Module(tensorrt, 3a77638) |- Module(tensorrt, 4aed8f8) |- Module(tensorrt, 3a38aa8) |- Module(tensorrt, 4a83748) |- Module(tensorrt, dcb4bf8) |- Module(tensorrt, 3eab758) |- Module(tensorrt, d712598) |- Module(tensorrt, dc46f48) |- Module(tensorrt, dc79f88) |- Module(tensorrt, dc4e5e8) |- Module(tensorrt, dca49c8) |- Module(tensorrt, da5cee8) |- Module(tensorrt, 3e32938) |- Module(tensorrt, db029b8) |- Module(tensorrt, dd11098) |- Module(tensorrt, dcd3938) |- Module(tensorrt, da4cd08) |- Module(tensorrt, daa4368) |- Module(tensorrt, 3df0448) |- Module(tensorrt, 3dd8858) |- Module(tensorrt, db07f08) |- Module(tensorrt, dcffc58) |- Module(tensorrt, d859fd8) |- Module(tensorrt, 3eba938) |- Module(tensorrt, 3ddfc88) |- Module(tensorrt, 3e04dd8) |- Module(tensorrt, d737348) |- Module(tensorrt, dcd4ce8) |- Module(tensorrt, da84218) |- Module(tensorrt, da9c6c8) |- Module(tensorrt, dcab6d8) |- Module(tensorrt, da73148) |- Module(tensorrt, daebc08) |- Module(tensorrt, 3e448f8) |- Module(tensorrt, dc61778) |- Module(tensorrt, dacb768) |- Module(tensorrt, dd3e3d8) |- Module(tensorrt, daabff8)
And the operator profile looks like:
Total number of operators: 85 Detail breakdown I.GlobalVar("tvmgen_default_tensorrt_main_172"): 1 Op(nn.instance_norm): 30 I.GlobalVar("tvmgen_default_tensorrt_main_162"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_152"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_142"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_139"): 1 Op(image.resize2d): 3 I.GlobalVar("tvmgen_default_tensorrt_main_129"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_119"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_109"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_106"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_96"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_86"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_76"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_74"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_64"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_54"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_44"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_34"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_0"): 1 Op(reshape): 8 I.GlobalVar("tvmgen_default_tensorrt_main_30"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_26"): 1 Op(cast): 2 I.GlobalVar("tvmgen_default_tensorrt_main_25"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_1"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_18"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_15"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_5"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_2"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_6"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_21"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_28"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_35"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_45"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_55"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_65"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_77"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_87"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_97"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_110"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_120"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_130"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_143"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_153"): 1 I.GlobalVar("tvmgen_default_tensorrt_main_163"): 1 TensorRT subgraph #: 42
I think the performance should be equal, or even TVM is better.
The performance gap between TensorRT and TVM's BYOC is huge.
OS: Ubuntu 20.04 TVM: Latest unity branch
I'm using the vae_decoder onnx model.
cc @billishyahao @shingjan
@Civitasv
Any update on this?
I have noticed that as the model size increases, the number of subgraphs also increases, and then the gap between tensorrt(purely) and tvm's byoc increases. Surely, I might use wrong.
I'm using TVM's BYOC approach to integrate TVM and TensorRT. But I observe a huge significant performance gap between them.
Firstly, I am using the VAE decoder model. Upon using TensorRT directly, the performance summary is as follows:
You can see it is about 34ms.
But when I use TVM's BYOC strategy using the following script:
It shows me like:
You can see it is about 101ms.
And I can confirm the BYOC of TensorRT works, the runtime module structure looks like:
And the operator profile looks like:
Expected behavior
I think the performance should be equal, or even TVM is better.
Actual behavior
The performance gap between TensorRT and TVM's BYOC is huge.
Environment
OS: Ubuntu 20.04 TVM: Latest unity branch
Steps to reproduce
I'm using the vae_decoder onnx model.
Triage
cc @billishyahao @shingjan