Faster batch size = 1 inference

Moelf commented 1 year ago

There are certain applications that needs to do one inference at time, for example in analyzing large data: https://indico.bnl.gov/event/15089/contributions/68235/attachments/43511/73312/Moneta-ROOT-FutureAnalysis.pdf#page=25

how can we make it faster and less allocating, i'd be happy to work on it

dfdx commented 1 year ago

Could you please clarify what exactly needs to work faster? If you have an ONNX graph or a Julia graph that can be exported to ONNX, then the simplest way to execute it faster seems to run it using ONNXRunTime. Is there a reason it won't work?

Moelf commented 1 year ago

It's designed to run on big batches, look at the slides

dfdx commented 1 year ago

Sorry, I still don't understand your proposal/request. This package is about conversion between Julia functions (mostly NNlib/Flux) and ONNX format. If you want an ONNX graph to run faster for single-item batches, then it might be better to post this issue at ONNXRuntime's repo (or repo of some other ONNX engine). If you want it to run faster on the Julia side, i.e. faster Umlaut.play!(tape), then we need at least some use cases to measure the performance and understand bottlenecks.

Moelf commented 1 year ago

yes, I want faster Ghost.play!() for size = 1

for example, can we pre-allocate memory?

Moelf commented 1 year ago

ONNXRuntime's repo

the slides

explicitly shows them beating ONNXRuntime. But yes, I understand "we don't have much motivation to make Julia faster than ONNXRuntime" is a valid response if so I can close the issue

ToucheSir commented 1 year ago

Reading through the documentation on SOFIE, it seems like they generate custom C++ code for every model. ONNX.jl used to do this (with Julia code), but it's really cumbersome and difficult to integrate into a normal workflow. I doubt the overhead from having to "interpret" ops instead of running source code is that high though, doubly so with the tape-based approach this package uses.

Personally, I would be interested in where some of the bottlenecks are for inference with Julia DL libraries. ONNX.jl uses a number of functions from NNlib, so improvements would likely be made there and then help out other packages as well.

Moelf commented 1 year ago

like they generate custom C++ code for every model.

indeed, idk why they re-invent stuff like that, I mean the NN primitives are just what they are.... I can only (uneducated) guess the overhead is mainly the tape playing (and the allocation model of this pipeline is optimized for bigger batch?)

ToucheSir commented 1 year ago

Many of the default kernels in NNlib are not terribly optimized and could definitely be improved upon. NNPack used to provide optimized versions for a couple, but that was dropped at some point for correctness reasons AIUI. There was also an attempt to write optimized kernels in pure Julia, but I believe that stalled due to time and not having a way to work around LoopVectorization.jl's latency penalty.

Moelf commented 1 year ago

I thought the non CUDA part of NNlib was already in Julia but I guess I'm wrong huh

dfdx commented 1 year ago

Compiling a tape is the easy part. You can already compile it into native Julia code using Umlaut.compile(tape), or you can generative custom Julia, C, CUDA or whatever else code.

The hard part is what optimizations you want to apply. Some use cases will benefit from graph structure optimization (which ONNXRuntime is remarkably good at, by the way). Others can be accelerated by kernel fusion. Yet others need buffers and in-place operations. Perhaps, the best way to attack this issue is to collect a set of real-life use cases and start experimenting.

(As a side note, recently I switched this package from Ghost to Umlaut due to weird dependency compatibility issue. This is a drop-in replacement though, so you can safely ignore the difference)

ToucheSir commented 1 year ago

I thought the non CUDA part of NNlib was already in Julia

It is, but not all of the algorithms used are the most optimal. For example, the default conv algorithm (im2col) isn't very memory-efficient, and pooling runs single-threaded.

Moelf commented 1 year ago

https://indico.cern.ch/event/1176076/contributions/4939648/attachments/2474114/4245117/SOFIE%40ICHEP.pdf#page=11

more slides for future reference, I can do a benchmark later to compare ONNX.jl to ONNXRuntime, but the slides is claiming faster than ONNXRuntime so...

FluxML / ONNX.jl

Faster batch size = 1 inference #77