Add TensorRT/OpenVINO to README.md

fabio-sim commented 1 year ago

Hi @Phil26AT @Skydes ,

I've managed to make a working TensorRT-compatible version of LightGlue, and fortunately, OpenVINO support came out of the box!

This PR adds that info to the README in case anyone would like to deploy using the aforementioned formats :)

sarlinpe commented 1 year ago

Very cool, thank you! We are very interested in a version that is easy to deploy and integrate in a C++-only environment but also fast (= with the adaptive mechanisms). To get there, I have a few questions:

According to this plot, the ONNX export is slower than the original LightGlue for >=1k keypoints, even with MP and Flash. Is it because ONNX doesn't support early stopping and adaptive pruning?
Does jitting support dynamic control flows? I see that you're exporting with tracing, which might not support control flows, but scripting might? If so, could you also benchmark a jitted model without ONNX export?
Can you run LightGlue on TensorRT without going through ONNX? if so, could it help leverage dynamic control flows for an even faster model?
Did you try to run torch.compile on the model as an additional data point?
If we can get the dynamic control flow to work, we are interested in merging your changes upstream and maintaining them. What would be the minimal changes to the model required to get jit scripting or compilation to work?

Thanks a lot for taking the time!

fabio-sim commented 1 year ago

You're correct that ONNX trace export does not support adaptive mechanisms (dynamic control flow like if-s, break-s, etc.), but the benchmarks were all on non-adaptive versions (i.e., no early stopping or adaptive pruning). I would venture a guess that the ONNX versions are slower at higher keypoint numbers due to differing operator implementations.
The code path you linked (the --safe flag option) was really an experiment to test whether ONNX script export works. Normally the default path is to let torch.onnx.export call jit.trace for ONNX trace export. Under script export, however, conditionals are exported to If operator nodes that lead to different subgraphs (creating a new subgraph for every possible branch), resulting in quite a messy ONNX graph, to say the least (a lot of emitted warnings during runtime). I've yet to test if an early exit (break/return) is scriptable, though. When you say you would like to benchmark a jitted model without ONNX export, do you mean to benchmark TorchScript?
I tried using polygraphy to directly convert the ONNX model to a TensorRT engine (see this issue https://github.com/fabio-sim/LightGlue-ONNX/issues/16 for details), but it looks impossible at the moment. I haven't gone the FX graph Torch-TensorRT route yet, but it also seems to be unsupported. In general dynamic control flow is extremely difficult to support.
I haven't had any time to test torch.compile yet, but my understanding is that it's mostly relevant for speeding up PyTorch and shouldn't have any effect during export?
I think whether that's even possible is still up in the air at the moment. Basically in order to have a shot at exporting via jit script, the forward pass of a module that contains dynamic control flow should look something like this, as well as satisfy some other signature constraints (e.g., no dict inputs if I recall correctly, etc.). Jit script is also particularly nit-picky that type hints match the actual runtime types in every called function since it inspects the source code directly.

Anyway, I'll give scripting a try to see if the adaptive mechanisms can be exported at all, but for the moment it looks like one must go through ONNXRuntime's TensorrtExecutionProvider. Thanks and I hope you find these answers helpful!

sarlinpe commented 1 year ago

Thank you very much for this very insightful reply!

IIUC the only dynamic control flow in LightGlue is enabled by adaptive depth: https://github.com/cvg/LightGlue/blob/fe7fb4fa0cffec65e33bf4c2f62a863d5b03433a/lightglue/lightglue.py#L398-L399 The adaptive width introduces dynamic shapes (via masking) but there should not be any dynamic if involved - only static conditions. This condition could be removed if it is detected as dynamic: https://github.com/cvg/LightGlue/blob/fe7fb4fa0cffec65e33bf4c2f62a863d5b03433a/lightglue/lightglue.py#L406-L407
Yes you could try to script the model and run it as is, without any export. AFAIK this is not as powerful as torch.compile (which uses Triton for graph fusion) but can still fuse simpler sub-graphs (elementwise ops).
In the future there will be a torch.export to export compiled graph to be executed in other environments. TorchScript is not maintained anymore and will be deprecated once torch.export becomes mature.

From official communications it's unclear whether torch.dynamo supports or will later support dynamic control flows. If not, we could instead export and compile a sub-graph for each layer and have the early stopping logic in the parent scope. This adds a synchronization point between layers but would still benefit from optimizations within each layer.

cvg / LightGlue

Add TensorRT/OpenVINO to README.md #35