comp-imaging / ProxImaL

A domain-specific language for image optimization.
MIT License
112 stars 29 forks source link

Baremetal only version? #57

Closed soufianekhiat closed 1 year ago

soufianekhiat commented 3 years ago

As some core is based on C++ & Halide. Is there a way to reproduce the results of the paper using only C++?

SteveDiamond commented 3 years ago

I don't know about that. You could ask Felix Heide (at Princeton now). He generated most of the paper results.

soufianekhiat commented 3 years ago

I could reframe my question as: Could we use ProxImaL without the python interface? Or some stuff live only on the python side?

jrk commented 3 years ago

Ah, I think that's a very different question — the Halide code generally lives under the C++ (giving implementations of some of the core operations), with Python above that. I think much of the fundamental methods' implementation lives at the Python level so it is not practical to cut that out, but @SteveDiamond is the right person for this query.

jrk commented 3 years ago

Can you say anything about your use case and motivation?

soufianekhiat commented 3 years ago

We have various use cases/motivations.

What is the form of this x86 code? What if I target an ARM? the M1 of Apple? Because we have 2 step. Describing the problem and compile to have the solver. In my opinion the solver generated should be portable to any target.

Which is the strategy used by Taichi:

As Halide have the possibility to have adjoint of any DAG of compilation that give us the "stronger" power on optimization. For the solver we can build one scheduler per platform or use auto_scheduler. Because CUDA is one target but OpenCL can be used to for non nVidia GPU.

It's mostly a personal opinion; Halide is doing a great job on their core feature. As it's C++ (with Python Binding) which is responsible of their success with Adobe and Google which they implement algoriithm on Halide and schedule (or generate) for each target.

We could achieve the same pattern with ProxImaL. In ideal world that would be like:

  1. Describe the Problem [ProxImaL]
  2. Compile to have the solver as Halide Kernel (C++ or Python). [ProxImaL ->Halide]
  3. Use Halide to build a scheduler or auto_schedule for {CUDA, OpenCL, x86, ...} [Halide]

I totally understand it's not trivial, this is my opinion on this question.

antonysigma commented 3 years ago

I am glad @jrk can chime in to help explore the next phase of the ProxImaL. Before I comment, I have a disclaimer: I recently ported the project to Nvidia Jetson embedded device https://github.com/comp-imaging/ProxImaL/pull/54.

Let me see if I understand @soufianekhiat 's use-case:

High-level requirement:

If we want to embed ProxImal to a binary where Python is not an option Python became a standard for the prototyping stage, but on various industries it's a "no go" for the industrialization/release stage.

I suggest that title of the Github issue be renamed to "Baremetal only version?" Or, to be more precise, "Request to generate ProxImaL Runtime code in baremetal environment"

(Toolchain / System) level specification

Which is the strategy used by Taichi...

As Halide have the possibility to have adjoint of any DAG of compilation that give us the "stronger" power on optimization.

For the solver we can build one scheduler per platform or use auto_scheduler. Because CUDA is one target but OpenCL can be used to for non nVidia GPU.

In my humble opinion, I think your prototyping workflow is more similar to the micro-TVM, currently in active development. They de-constructed the Halide-toolchain, and then reassembled the low-level arithmetic operations to generate LLVM Byte-code to cross-compile on baremetal environment. https://tvm.apache.org/docs/microtvm/index.html

image

Hardware level specification

On the paper we have: "The proximal and linear operators are evaluated using a combination of NumPy and Halide-generated parallel and vectorized x86 code. We are planning to extend our framework to compile into Halide-generated GPU code." What is the form of this x86 code? What if I target an ARM? the M1 of Apple?

Could you clarify what specific device you try to run ProxImaL-generated code on? How much memory? How wide is the underlying SIMD instructions?

For your reference, my own use case is to generate FlexISP style runtime code on Nvidia Jetson System-on-board. It does come with sufficient RAM and storage on the CPU execution environment to run the Python / Toolchain side of the ProxImaL, while the Halide / PyCUDA generated CUDA code runs on (baremetal) GPU devices.

Component level specification

We could achieve the same pattern with ProxImaL. In ideal world that would be like:

  1. Describe the Problem [ProxImaL]
  2. Compile to have the solver as Halide Kernel (C++ or Python). [ProxImaL ->Halide]
  3. Use Halide to build a scheduler or auto_schedule for {CUDA, OpenCL, x86, ...} [Halide]

I suggest that we fork out a separate Github issue to discuss this. @jrk and @SteveDiamond can chime in as well.

My first thought is that ProxImaL, just like FlexISP, generates a cyclic compute graph. The lack of compute cycles in Halide is a fundamental strength to customize the compute schedule (or weakness in your use-case?).

As the authors suggested in the original article, I think the ProxImaL project can still grow by

  1. fusing more proximal.linops.* nodes in the proximal.CompGraph into a single Halide pipeline; and
  2. fusing the individual proximal update steps in ADMM into Halide pipeline.

But eventually the Halide-generate code will have to hand back control to the ADMM while-loop to complete the "cycle". Whether the while-loop has to be coded in C++ or to remain in Python, I do not know.

soufianekhiat commented 3 years ago

I am glad @jrk can chime in to help explore the next phase of the ProxImaL. Before I comment, I have a disclaimer: I recently ported the project to Nvidia Jetson embedded device #54.

I followed this merge, which re-bright back the interest to the ProxImaL ( :

(Toolchain / System) level specification In my humble opinion, I think your prototyping workflow is more similar to the micro-TVM, currently in active development. They de-constructed the Halide-toolchain, and then reassembled the low-level arithmetic operations to generate LLVM Byte-code to cross-compile on baremetal environment. https://tvm.apache.org/docs/microtvm/index.html

image

I don't know, I'm not aware of micro-TVM.

Hardware level specification

Could you clarify what specific device you try to run ProxImaL-generated code on? How much memory? How wide is the underlying SIMD instructions?

For your reference, my own use case is to generate FlexISP style runtime code on Nvidia Jetson System-on-board. It does come with sufficient RAM and storage on the CPU execution environment to run the Python / Toolchain side of the ProxImaL, while the Halide / PyCUDA generated CUDA code runs on (baremetal) GPU devices.

To elaborate a bit on that. The reason of not using Python is not only related to memory. Some target (legally) forbid the usage of interpreted code (could be fixed with Cython?). But for instance on Game Console with AMD GPU Device CUDA is not usable. It's hard for me to defend the usage of Python with a callstack of 400 functions just to multiply 2 matrix with (TensorFlow/Numpy). To compare with Eigen it's 400x faster in favor or C++ (on one given benchmark). Sometime we work on a context where everything is used at 99% (CPU, GPU, Memory), so any gain is welcome. In opposition with Halide after writing a Func we can write a scheduler and compile_to_file, to binary, to llvm, ... Which allow us to link it to any target now llvm is pretty standard on shader with SPIR-V. We can target for the same schedule on {CUDA, OpenCL, DirectX} to have a compute shader which is piped with the rest of the pipeline. Another example is mobile with ARM, Halide allow us to compile for specifying the target with a given instruction set.

My first thought is that ProxImaL, just like FlexISP, generates a cyclic compute graph. The lack of compute cycles in Halide is a fundamental strength to customize the compute schedule (or weakness in your use-case?).

For Cyclic or not I think we can live with it for Halide. In a context of of iterative solver the cyclic part is manageable. We have implementation of LSTM with Halide.

antonysigma commented 2 years ago

I could reframe my question as: Could we use ProxImaL without the python interface? Or some stuff live only on the python side? From @soufianekhiat .

If the problem is convex in nature, @SteveDiamond 's most recent work on CVXPYgen could be a better option. The generated solver in C may not include any proximal / ADMM algorithms, not Halide code, but it should fit the needs for image restorations in small size.

soufianekhiat commented 2 years ago

To clarify the needs. That's interesting, but I'm looking for a pipeline without Python at all, let say an artist build a computional graph with a Node Based Editor, on my side I solve the problem they want (Halide auto schedule before processing), or directly solving it with an optimization algorithm. I would like to add the features allowed by ProxImaL but only based on C++. The full C++ with Python wrapping is to my opinion the way to go (Halide, TensorFlow, ...) and the main reason of the success of those API for various industries. Various use cases just forbid python for various reason (energy consumption, as design forbidden by the owner of the platform, for safety reason, ...). I love python for prototyping phase, but when we want to release a product we need our C++ dirty hand (: [Side note, I know some industry allow python for the final product]

SteveDiamond commented 2 years ago

This is helpful clarification!The output of CVXPYgen does not involve any Python code though, so it may still be useful. Python is only involved in code generation.

soufianekhiat commented 2 years ago

In fact CVXPYgen himself need Python, so If I know the problem to solve before I need to ship Python and a C++ compiler (build dll + hot reload) with the tool. Currently I'm building everything programmatically with Halide + JIT.

antonysigma commented 2 years ago

Got it. Porting ProxImaL or CVXpyGen to a pure C++ implementation takes a mammoth code refactoring effort with low return on investment. Have you considered deploying the pipeline using the Canadian-cross build process?

That is, create ProxImaL-Gen or CVXpyGen to run on the PC (Windows or Linux OS) to export Halide code, then compile the halide-AOT backed generator. Execute the generator app to export the static library containing the ADMM solver pipeline. Lastly, statically or dynamically link the library in your (potentially RISC-V or ARM 64-bit) runtime environment.

image

I don't speak for @SteveDiamond the project owner. But from my past experience, he is willing to accept PR, either it is an orphaned working code in Halide, C++ or Python. I suggest @soufianekhiat can start by porting https://github.com/comp-imaging/ProxImaL/blob/master/proximal/algorithms/linearized_admm.py to Halide. Other contributors will follow suit, I believe.

soufianekhiat commented 1 year ago

I wasn't aware of this naming "Canadian-cross build process". I'll do my best to contribute, I'll take a look on linearized_admm. Thanks