VivekPanyam / carton

Run any ML model from any programming language.
https://carton.run
Apache License 2.0
421 stars 11 forks source link

ONNX support #165

Open VivekPanyam opened 11 months ago

VivekPanyam commented 11 months ago

There are many different ways of running an ONNX model from Rust:

tract

"Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference".

Notes:

wonnx

"A WebGPU-accelerated ONNX inference run-time written 100% in Rust, ready for native and the web"

Notes:

ort

"A Rust wrapper for ONNX Runtime"

Notes:

If we're going to have one "official" ONNX runner, it should probably use ort. Unfortunately, since ort doesn't have WASM support, we need another solution for running from WASM environments.

This could be:

@kali @pixelspark @decahedron1 If you get a chance, I'd really appreciate any thoughts you have on the above. Thank you!

decahedron1 commented 11 months ago

FWIW, I recently pushed ort v1.15.5 which adds support for WASM.

VivekPanyam commented 11 months ago

Oh wow, that's great! I see you landed https://github.com/pykeio/ort/commit/092907a5d2ddd7ec4c9340d2cddc04cb293e003d a bit after I created this issue :)

I'm working on getting the GPT2 example working from WASM and I'll comment with how it goes!

Is there a WebGPU or WebGL execution provider btw?

The ONNX Runtime website says:

you have the option to use webgl or webgpu for GPU processing, and WebAssembly (wasm, alias to cpu) for CPU processing. All ONNX operators are supported by WASM but only a subset are currently supported by WebGL and WebGPU.

decahedron1 commented 11 months ago

you have the option to use webgl or webgpu for GPU processing, and WebAssembly (wasm, alias to cpu) for CPU processing. All ONNX operators are supported by WASM but only a subset are currently supported by WebGL and WebGPU.

I couldn't find any documentation on how to actually use either backend. I think it may be automatically available just by compiling with --use_jsep but I'm not sure. I'll keep looking into it.

VivekPanyam commented 11 months ago

Thanks! It looks like a WASM build with 1.15.5 fails:

  1. close_lib_handle is not defined for WASM

https://github.com/pykeio/ort/blob/bca00dc96d8e6fd047fa44ebd5c5287517ed0af1/src/session.rs#L759-L767

  1. std::os::unix::ffi::OsStrExt; doesn't exist on WASM

https://github.com/pykeio/ort/blob/bca00dc96d8e6fd047fa44ebd5c5287517ed0af1/src/session.rs#L5-L6

I am using wasm32-unknown-unknown though. I don't believe using wasm32-wasi would fix it, but I noticed you're building using emscripten. Does that work?

decahedron1 commented 11 months ago

I got a simple MNIST test working on wasm32-unknown-emscripten. Since ONNX Runtime itself is compiling with Emscripten I don't believe it would work on wasm32-unknown-unknown either way.

VivekPanyam commented 11 months ago

@decahedron1 Could post your test code somewhere please?

The emscripten thing makes sense. Even if we compiled the rest of the code without emscripten, we'd still need all the emscripten runtime components to actually make the ONNX Runtime itself work

@katopz I know we spoke in https://github.com/VivekPanyam/carton/issues/159#issuecomment-1740248225 about you exploring wonnx and WASM, but would you be open to trying to get this working with ort?

Ideally, we'd first test that ort works from WASM (with and without WebGPU) and then we can build a basic ONNX runner that supports Linux, macOS and WASM.

decahedron1 commented 11 months ago

@VivekPanyam Certainly: https://github.com/decahedron1/carton-ort-wasm-example

It seems like WebGPU support with Microsoft ONNX Runtime would be much more difficult than I was anticipating - you'd have to somehow include their JavaScript code (slightly more info in the PR - https://github.com/microsoft/onnxruntime/pull/14579) and connect it to the proper places, which I'm not sure is even possible with --build_wasm_static_lib, so wonnx might be worth exploring for GPU acceleration on web.

VivekPanyam commented 11 months ago

Thank you! I'll check it out.

Okay, so then I think we have a few potential solutions:

1. wonnx on platforms with WebGPU available and ort everywhere else.

Straightforward, but could cause issues if a model works with ort, but fails with wonnx (or vice versa).

2. Use wonnx everywhere (if it can also run without GPUs).

This provides a consistent user experience.

I think we'd need to explore inference performance and supported operators vs ort if we decide to look into the second approach.

3. Integrate all three runtimes into a single runner

Another approach is to integrate all three runtimes into a single runner and allow users to do the following:

I think there might be a way to do option 3 in a way where it has a clean user experience, but we'd have to be careful about the default logic. I think it would be confusing to users/could break things if we changed the default implementation selection logic after the runner was released.

Maybe a hybrid of 1 and 3 would work and users can decide to use WebGPU or not at inference time.

Proposal

I think we should start by implementing a runner that uses ort everywhere it's supported. We can then add in WebGPU support with wonnx and make it an explicit opt-in at inference time.

So it'll always use ort (and the "official" ONNX Runtime) unless you explicitly tell it to use wonnx with WebGPU.

And if we want to, there's nothing stopping us from extending that to tract. ort is always the default and everything else is an explicit opt-in.

Thoughts?

Also @decahedron1, would you be open to building/helping build a runner for Carton that uses ort?

If so, @katopz could continue exploring wonnx

katopz commented 11 months ago

Will do, wonnx and ort is on my waiting list. Anyway yesterday I try explore/build/compile native/wasm examples from https://github.com/huggingface/candle (yes i still evaluate things here) I like to know what your thought on candle approach?

pixelspark commented 11 months ago

@VivekPanyam I generally agree with your assessment. wonnx is an option if you are looking for a relatively lightweight (and Rust native) way to run ONNX models on GPU. In essence wonnx translates ONNX models to wgsl shaders and executes these using wgpu on the GPU.

I have no experience with CPU-based implementations of WebGPU, apart from the fact that we use it in CI to run some tests. wonnx can't run on the web in WASM if the browser does not offer (or has disabled) WebGPU support. On the web, wgpu is merely a passthrough layer to the underlying browser-implemented WebGPU API (which in Firefox again is based on wgpu by the way!).

An important thing to consider is the support for ops, which differs between the engines. wonnx certainly does not support all ops. Additionally wonnx works using 'ahead of time' compilation of shaders (which means all shapes need to be known in advance - there is shape inference functionality for this), and because of this certain ops with dynamic shapes are not supported and will be very hard to support in the future.

VivekPanyam commented 11 months ago

@pixelspark That makes sense. So explicit opt-in is probably a safe bet (as long as we can design that in a way that isn't confusing to users).

@pixelspark @decahedron1 Thank you both for taking the time to provide your thoughts!

VivekPanyam commented 11 months ago

Will do, wonnx and ort is on my waiting list. Anyway yesterday I try explore/build/compile native/wasm examples from https://github.com/huggingface/candle (yes i still evaluate things here) I like to know what your thought on candle approach?

@katopz see #164

In general, please try to keep issues focused on their original topic. For more open ended conversations, consider creating a discussion. Thanks!

VivekPanyam commented 11 months ago

@katopz do you want to build an ONNX runner using ort (and then we can add wonnx and WebGPU support once the runner is working)?

katopz commented 11 months ago

Sorry, to say but not real soon because

  1. I still has no idea how to accomplish that yet.
  2. I will get fire in the next 2 months so no time for hobby yet. 🫠

In the meantime you can assign that task to anyone.

mstfbl commented 11 months ago

@VivekPanyam I also agree with your assessment on using pykeio/ort by default and having wonnx as an explicit opt-in. There can be a case made for having wonnx as default with WebGPU pending performance seen in experiments comparing the two Rust ONNX wrappers.

I also agree with @pixelspark's comment on considering support for ops. It’s more than reasonable to assume ONNX Runtime supports all operator kernels. With contrib and custom ops there seems to be support, but I'd be careful starting out. At the moment pykeio/ort seems to support ONNX v.1.15.1 whereas ONNX Runtime's latest version is v1.16.0, so for a given custom op it's worth verifying its support in pykeio/ort first.

VivekPanyam commented 11 months ago

@katopz I'm sorry to hear that. I hope things work out in a way you'd like them to.

VivekPanyam commented 11 months ago

@mstfbl Makes sense, thanks!

If anyone is interested in implementing a runner with ort, feel free to comment below :)