Open prigoyal opened 6 years ago
This is related to support for explicit loops in the frontend. Supporting variable sizes (i.e. symbolic shapes in TVM) in general is probably a substantial change, but allowing explicit loops, declared in the frontend code, over a single variable-sized dimension is very similar to the existing support for variable batch sizes but would cover many NLP use cases (like RNNs/QRNNs).
Technically, this is simpler than it looks. Most of the compilation flow should support this transparently (Halide and polyhedral passes). For example, polyhedral scheduling is meant to operate on symbolic parameters and we have an option to substitute them with inferred numerical values before or _after the scheduling itself. We even have tests that emit parametric code.
However, this will degrade performance. Simply put, the more information we have about the operation, the deeper we can analyze it, the better it can be optimized. So I'd argue for as specialized code as possible.
The main problem with RNNs now would be their outer sequentiality. But this is mostly orthogonal to variable sizes.
An RNN kernel would look something like this:
def elman_rnn(float(T,B,Ci) input, float(B,Co) h0, float(Ci,Co) i2h, float(Co,Co) h2h) -> (hidden) {
for t in T {
if t == 0 {hidden(t,b,co) +=! h2h(ci,co) * h0(b,ci)}
else {hidden(t,b,co) +=! h2h(ci,co) * hidden(t-1,b,ci)}
hidden(t,b,co) += i2h(ci,co) * input(t,b,ci)
}
}
which does indeed seem pretty annoying to support, far beyond the variable T. I was also wrong and TC doesn't appear to currently support optimizing for a variable batch size.
A QRNN kernel would also have these issues, just without the reduction inside the loop.
Indeed, the "imperative syntax" proposed earlier in TC context is not yet implemented in the language. It is annoying to support efficient code generation. The "imperative loop" is outer sequential and the current compilation pass would just map the computation to a single block given that syntax is supported by the frontend. We would need to emit global synchronizations even for a naïve mapping to more than one block, and we currently cannot. But this seems orthogonal to variable batch sizes.
Turning on parametric batch sizes is a small change. In general, TC looks at the actual sizes of the supplied tensors and infers numerical values of all symbolic parameters. These values are substituted in some optimization passes. Disabling this substitution looks trivial. However, disabling it for a specific parameter requires the user to somehow tell us for which one.
Let me first share some context about parametric sizes and then a short-term "solution" that should work in practice.
Solving the general problem is complicated as it requires inferring proper parameter regions and devising proper strategies for each region; this is a longer term goal to make it push button and just work. Additionally emitting symbolic code involves inefficiencies (control-flow that just disappears with JIT'ing or missing information about parameter ranges that messes up internal heuristics).
The current approach works because it is pragmatic and goes for the lowest hanging fruit, however it suffers from needing to autotune for various sizes if one isn't careful. Compilation given fixed options however is not a big deal. It happens already under the hood all the time when you emit SASS from PTX (the first time you run a kernel, then it gets cached to disk).
One simple way to circumvent the autotuning pain on the user side is to just reuse the options found by an autotuning run when you change sizes. This would give the same type of behavior that one would get from a parametric codegen: a. with parametric codegen, the options don't change, the code is compiled, cached and autotuned only once; b. with options reuse, the options don't change, the code is autotuned only once but the code is compiled and cached for each new size; this should significantly improve the user experience though
So for a short-term improvement in the workflow I would recommend:
compile
2b. alternatively use the options in 1 as a starting point for calling a small autotune
run with few generations/candidates to do some quick minimal explorationHiding the above from the user amounts to solving the general problem but I think the faster workflow outlined in 1-4 is very easy to setup and should improve things significantly. @jekbradbury 's example about RNNs is different and we have similar things in the work but only at a conceptual stage atm.
@Roadwalker does the above sound like a reasonable solution?
@jekbradbury @seongwook-ham see if #225 starts addressing your needs. The code is still JIT'ed and will take a few seconds for various parameter sizes but autotuning results can be easily reused.
If you have real world parametric needs where you see many different versions for 1 parameter (i.e. > 100) then we probably can give you 1-3 parameters in a relatively short term but it would be great to work on a concrete example.
RNN loops are still outside of the scope at this point though.
is there a way for us to support variable tensor sizes? right now, if the tensor size changes, we have to recompile and cache. But often in computer vision, NLP, people have models where tensor size for a layer changes at every step of training
for example:
should user be writing this for all input sizes? compiling and caching every time.
This is a big use case for computer vision and NLP models and having to compile at every step is going to slow down things quite a bit.