Variable tensor sizes support for TC

prigoyal commented 6 years ago

is there a way for us to support variable tensor sizes? right now, if the tensor size changes, we have to recompile and cache. But often in computer vision, NLP, people have models where tensor size for a layer changes at every step of training

for example:

def sum_reduce_dims023_4d(float(N ,C, H, W) X) -> Y {
   Y(c) += ! X(n, c, h, w)
}

should user be writing this for all input sizes? compiling and caching every time.

This is a big use case for computer vision and NLP models and having to compile at every step is going to slow down things quite a bit.

jekbradbury commented 6 years ago

This is related to support for explicit loops in the frontend. Supporting variable sizes (i.e. symbolic shapes in TVM) in general is probably a substantial change, but allowing explicit loops, declared in the frontend code, over a single variable-sized dimension is very similar to the existing support for variable batch sizes but would cover many NLP use cases (like RNNs/QRNNs).

ftynse commented 6 years ago

Technically, this is simpler than it looks. Most of the compilation flow should support this transparently (Halide and polyhedral passes). For example, polyhedral scheduling is meant to operate on symbolic parameters and we have an option to substitute them with inferred numerical values before or _after the scheduling itself. We even have tests that emit parametric code.

However, this will degrade performance. Simply put, the more information we have about the operation, the deeper we can analyze it, the better it can be optimized. So I'd argue for as specialized code as possible.

The main problem with RNNs now would be their outer sequentiality. But this is mostly orthogonal to variable sizes.

jekbradbury commented 6 years ago

An RNN kernel would look something like this:

def elman_rnn(float(T,B,Ci) input, float(B,Co) h0, float(Ci,Co) i2h, float(Co,Co) h2h) -> (hidden) {
    for t in T {
        if t == 0 {hidden(t,b,co) +=! h2h(ci,co) * h0(b,ci)}
        else {hidden(t,b,co) +=! h2h(ci,co) * hidden(t-1,b,ci)}
        hidden(t,b,co) += i2h(ci,co) * input(t,b,ci)
    }
}

which does indeed seem pretty annoying to support, far beyond the variable T. I was also wrong and TC doesn't appear to currently support optimizing for a variable batch size.

A QRNN kernel would also have these issues, just without the reduction inside the loop.

ftynse commented 6 years ago

Indeed, the "imperative syntax" proposed earlier in TC context is not yet implemented in the language. It is annoying to support efficient code generation. The "imperative loop" is outer sequential and the current compilation pass would just map the computation to a single block given that syntax is supported by the frontend. We would need to emit global synchronizations even for a naïve mapping to more than one block, and we currently cannot. But this seems orthogonal to variable batch sizes.

Turning on parametric batch sizes is a small change. In general, TC looks at the actual sizes of the supplied tensors and infers numerical values of all symbolic parameters. These values are substituted in some optimization passes. Disabling this substitution looks trivial. However, disabling it for a specific parameter requires the user to somehow tell us for which one.

nicolasvasilache commented 6 years ago

Let me first share some context about parametric sizes and then a short-term "solution" that should work in practice.

Solving the general problem is complicated as it requires inferring proper parameter regions and devising proper strategies for each region; this is a longer term goal to make it push button and just work. Additionally emitting symbolic code involves inefficiencies (control-flow that just disappears with JIT'ing or missing information about parameter ranges that messes up internal heuristics).

The current approach works because it is pragmatic and goes for the lowest hanging fruit, however it suffers from needing to autotune for various sizes if one isn't careful. Compilation given fixed options however is not a big deal. It happens already under the hood all the time when you emit SASS from PTX (the first time you run a kernel, then it gets cached to disk).

One simple way to circumvent the autotuning pain on the user side is to just reuse the options found by an autotuning run when you change sizes. This would give the same type of behavior that one would get from a parametric codegen: a. with parametric codegen, the options don't change, the code is compiled, cached and autotuned only once; b. with options reuse, the options don't change, the code is autotuned only once but the code is compiled and cached for each new size; this should significantly improve the user experience though

So for a short-term improvement in the workflow I would recommend:

take a TC + sizes and run the autotuner with large enough generations/candidates on it, save the best options
take the same TC, change sizes, reuse the options in 1 and just call compile 2b. alternatively use the options in 1 as a starting point for calling a small autotune run with few generations/candidates to do some quick minimal exploration
when sizes change too much and perf is too far from some notion of good, repeat step 1.
as always, save your cached results in a proto FB file that you can reuse across runs so you only need to autotune / compile the least amount possible; note however that we are still very early on in the life of the project and the proto itself is subject to change so 1-3 will prob need to be redone a few times until we iterate to a stable enough system.

Hiding the above from the user amounts to solving the general problem but I think the faster workflow outlined in 1-4 is very easy to setup and should improve things significantly. @jekbradbury 's example about RNNs is different and we have similar things in the work but only at a conceptual stage atm.

@Roadwalker does the above sound like a reasonable solution?

nicolasvasilache commented 6 years ago

@jekbradbury @seongwook-ham see if #225 starts addressing your needs. The code is still JIT'ed and will take a few seconds for various parameter sizes but autotuning results can be easily reused.

If you have real world parametric needs where you see many different versions for 1 parameter (i.e. > 100) then we probably can give you 1-3 parameters in a relatively short term but it would be great to work on a concrete example.

RNN loops are still outside of the scope at this point though.

facebookresearch / TensorComprehensions

Variable tensor sizes support for TC #46