apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.46k stars 3.41k forks source link

[Bug] [Relay] [BYOC] [Tensorrt] Significant performance gap observed between direct TensorRT usage and TVM's BYOC approach. #15379

Open Civitasv opened 1 year ago

Civitasv commented 1 year ago

I'm using TVM's BYOC approach to integrate TVM and TensorRT. But I observe a huge significant performance gap between them.

Firstly, I am using the VAE decoder model. Upon using TensorRT directly, the performance summary is as follows:

[07/21/2023-17:37:24] [I] === Performance summary ===
[07/21/2023-17:37:24] [I] Throughput: 29.5276 qps
[07/21/2023-17:37:24] [I] Latency: min = 32.5517 ms, max = 41.0814 ms, mean = 33.5797 ms, median = 33.2942 ms, percentile(90%) = 34.4155 ms, percentile(95%) = 34.7879 ms, percentile(99%) = 41.0814 ms
[07/21/2023-17:37:24] [I] Enqueue Time: min = 1.86743 ms, max = 3.68506 ms, mean = 3.11519 ms, median = 3.28667 ms, percentile(90%) = 3.34058 ms, percentile(95%) = 3.36389 ms, percentile(99%) = 3.68506 ms
[07/21/2023-17:37:24] [I] H2D Latency: min = 0.0218506 ms, max = 0.0581055 ms, mean = 0.0320333 ms, median = 0.0335693 ms, percentile(90%) = 0.0351562 ms, percentile(95%) = 0.0424805 ms, percentile(99%) = 0.0581055 ms
[07/21/2023-17:37:24] [I] GPU Compute Time: min = 32.4423 ms, max = 40.9753 ms, mean = 33.4738 ms, median = 33.1878 ms, percentile(90%) = 34.3101 ms, percentile(95%) = 34.689 ms, percentile(99%) = 40.9753 ms
[07/21/2023-17:37:24] [I] D2H Latency: min = 0.0646973 ms, max = 0.0771484 ms, mean = 0.0739099 ms, median = 0.0740356 ms, percentile(90%) = 0.0749512 ms, percentile(95%) = 0.0751953 ms, percentile(99%) = 0.0771484 ms

You can see it is about 34ms.

But when I use TVM's BYOC strategy using the following script:

from tvm.relay.op.contrib import tensorrt

...
mod = tensorrt.partition_for_tensorrt(mod, params, target)

with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target="cuda", params=params)

# load parameters
dev = tvm.cuda(0)
module_exec = runtime.GraphModule(lib["default"](dev))

print(module_exec.benchmark(dev, number=1, repeat=10))

It shows me like:

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)
  101.8611     101.3974     103.1465     101.3486      0.6918

You can see it is about 101ms.

And I can confirm the BYOC of TensorRT works, the runtime module structure looks like:

Runtime module structure:
         Module(const_loader, 3d6d3e8)
          |- Module(llvm, d523f68)
          |- Module(tensorrt, d71b768)
          |- Module(tensorrt, d7425f8)
          |- Module(tensorrt, 36c1758)
          |- Module(tensorrt, 4ad32c8)
          |- Module(tensorrt, 3a77638)
          |- Module(tensorrt, 4aed8f8)
          |- Module(tensorrt, 3a38aa8)
          |- Module(tensorrt, 4a83748)
          |- Module(tensorrt, dcb4bf8)
          |- Module(tensorrt, 3eab758)
          |- Module(tensorrt, d712598)
          |- Module(tensorrt, dc46f48)
          |- Module(tensorrt, dc79f88)
          |- Module(tensorrt, dc4e5e8)
          |- Module(tensorrt, dca49c8)
          |- Module(tensorrt, da5cee8)
          |- Module(tensorrt, 3e32938)
          |- Module(tensorrt, db029b8)
          |- Module(tensorrt, dd11098)
          |- Module(tensorrt, dcd3938)
          |- Module(tensorrt, da4cd08)
          |- Module(tensorrt, daa4368)
          |- Module(tensorrt, 3df0448)
          |- Module(tensorrt, 3dd8858)
          |- Module(tensorrt, db07f08)
          |- Module(tensorrt, dcffc58)
          |- Module(tensorrt, d859fd8)
          |- Module(tensorrt, 3eba938)
          |- Module(tensorrt, 3ddfc88)
          |- Module(tensorrt, 3e04dd8)
          |- Module(tensorrt, d737348)
          |- Module(tensorrt, dcd4ce8)
          |- Module(tensorrt, da84218)
          |- Module(tensorrt, da9c6c8)
          |- Module(tensorrt, dcab6d8)
          |- Module(tensorrt, da73148)
          |- Module(tensorrt, daebc08)
          |- Module(tensorrt, 3e448f8)
          |- Module(tensorrt, dc61778)
          |- Module(tensorrt, dacb768)
          |- Module(tensorrt, dd3e3d8)
          |- Module(tensorrt, daabff8)

And the operator profile looks like:

Total number of operators: 85
Detail breakdown
        I.GlobalVar("tvmgen_default_tensorrt_main_172"): 1
        Op(nn.instance_norm): 30
        I.GlobalVar("tvmgen_default_tensorrt_main_162"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_152"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_142"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_139"): 1
        Op(image.resize2d): 3
        I.GlobalVar("tvmgen_default_tensorrt_main_129"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_119"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_109"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_106"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_96"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_86"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_76"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_74"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_64"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_54"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_44"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_34"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_0"): 1
        Op(reshape): 8
        I.GlobalVar("tvmgen_default_tensorrt_main_30"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_26"): 1
        Op(cast): 2
        I.GlobalVar("tvmgen_default_tensorrt_main_25"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_1"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_18"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_15"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_5"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_2"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_6"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_21"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_28"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_35"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_45"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_55"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_65"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_77"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_87"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_97"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_110"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_120"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_130"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_143"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_153"): 1
        I.GlobalVar("tvmgen_default_tensorrt_main_163"): 1
TensorRT subgraph #: 42

Expected behavior

I think the performance should be equal, or even TVM is better.

Actual behavior

The performance gap between TensorRT and TVM's BYOC is huge.

Environment

OS: Ubuntu 20.04 TVM: Latest unity branch

Steps to reproduce

I'm using the vae_decoder onnx model.

Triage

cc @billishyahao @shingjan

twmht commented 1 year ago

@Civitasv

Any update on this?

Civitasv commented 1 year ago

I have noticed that as the model size increases, the number of subgraphs also increases, and then the gap between tensorrt(purely) and tvm's byoc increases. Surely, I might use wrong.