Open AriMKatz opened 4 years ago
Happy to help give pointers if you want to hack on any of these things. Something in the ONNX/FluxJS/deployment bucket would be easy to get started with. WebAssembly.jl is solid and would probably make Mjolnir->WASM quite easy.
That all sounds quite excellent, I'm very excited about this work and what it bodes for me being able to use Julia at work :)
To start, I'd like to explore working on emitting code for resource constrained systems. My initial inclination is that it would be initially easiest to go for targeting tensorflow lite which does things like quantization etc, potentially even TF lite for microcontrollers to get at even lighter targets like https://www.youtube.com/watch?v=HzCRZsGJLbI . Another possible target is : https://github.com/google/iree
Though, to what extent it would be a good idea to skip all that just work on emitting slim c code? Especially because I'm not sure yet if TF lite for microcontrollers allows use of custom ops.
I'm going to have to do a bit more digging to sharpen this, but these are my initial thoughts.
Edit: I don't want to get ahead of myself though. Perhaps just focusing on basic TF lite for now would be best, though I'd need to be able to integrate custom ops.
Another question I need to explore is at what point in the stack does quantization need to happen: https://blog.tensorflow.org/2020/04/quantization-aware-training-with-tensorflow-model-optimization-toolkit.html
It's a little clumsy right now, but here's how you can get a graph for the forward pass of a simple model, ready to deploy:
Turning this into a graph for whatever framework, or even C code, should be pretty straightforward.
would be a good idea to skip all that just work on emitting slim c code?
I think this could be a nice approach; the main potential problem is that we support broadcasting/mapping arbitrary functions. That's hard to do in C but might be possible in a templated C++ library like Eigen. XLA can do it too, so perhaps TF lite can. The other option is to only support built-in activation functions.
Theoretically, I think you may even be able to just get XLA to dump object code, but I've no idea how hard that is in practice.
at what point in the stack does quantization need to happen
This is a good question that I'm not sure of either. AIUI you can potentially do quantisation (and similar things like weight pruning) before training or after it, as a deployment optimisation. It feels like that could be a fairly straightforward API in Flux (basically an fmap
to convert the weights and make sure we support low precision in the AD), but I'm not sure if these techniques ever take advantage of the network structure somehow.
Hello Mike,
In the spirit of your readme, I'm wondering to what extent this package can or is intended to address some common pain points aside from speeding up flux/zygote:
For those that apply, are they planned roadmap items, and if not, how much additional work would they required?
Thanks