Allow using scalars for bounds inference

facebookresearch / TensorComprehensions

A domain specific language to express machine learning workloads.

https://facebookresearch.github.io/TensorComprehensions/

Apache License 2.0

1.76k stars 211 forks source link

Allow using scalars for bounds inference #73

Open prigoyal opened 6 years ago

prigoyal commented 6 years ago

We can't use scalar inputs in the bounds inference right now. So for example:

LANG="""
def avgpool(float(B, C, H, W) input, float kH, float kW, float sH, float sW) -> (output) {
    output(b, c, h, w) += input(b, c, h * sH + kh, w * sW + kw) where kh in 0:kH, kw in 0:kW
}
"""

LANG="""
def avgpool(float(B, C, H, W) input, float kH, float kW) -> (output) {
    output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:kH, kw in 0:kW
}
"""

avgpool = tc.define(LANG, name="avgpool")
inp = torch.ones(1, 1, 4, 4).cuda()
kH = torch.randn(1).fill_(2.0).cuda()
kW = torch.randn(1).fill_(2.0).cuda()
sH = torch.randn(1).fill_(1.0).cuda()
sW = torch.randn(1).fill_(1.0).cuda()
out = avgpool(inp, kH, kW, sH, sW)

this will fail.

The workaround right now is to do proper substitution for those scalars in the TC before passing them to the backend.

abadams commented 6 years ago

Those inputs are 1-D, not scalar. I think you want:

def avgpool(float(B, C, H, W) input, float kH, float kW, float sH, float sW) -> (output) {
    output(b, c, h, w) += input(b, c, h * sH + kh, w * sW + kw) where kh in 0:kH, kw in 0:kW
}

def avgpool(float(B, C, H, W) input, float kH, float kW) -> (output) {
    output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:kH, kw in 0:kW
}

prigoyal commented 6 years ago

yep, thanks for catching, updating the description :)

prigoyal commented 6 years ago

for more information, copy pasting some information from our internal discussion on this:

Andrew: "The issue is that it requires us to do compute outside of the scope where we allocate outputs. E.g. we need to load from input tensors before we allocate outputs. But input tensors may only exist in device memory, etc..."

Andrew: "It requires a change to our model of computation."

Andrew: "One potential approach is to special-case zero-dimensional tensors. But that probably requires a bunch of pytorch plumbing changes to treat them as actual scalars instead of zero-d tensors and it would be disappointing/surprising if you can use t() in a where clause but not t(3) E.g. you may want to pass the pooling shape as a 1-d tensor with two entries"

jekbradbury commented 6 years ago

Those PyTorch plumbing changes have mostly already happened and 0.4 will have first-class scalars

abadams commented 6 years ago

By pytorch plumbing changes I meant changes within TC's flow from inputs into the kernel more generally. Even with these plumbing changes, I'd see there being a type-system difference between t() and t(3) to be a language wart. I think we just need to accept that bounds evaluation will potentially need to load values from tensors as well as inspect their shape (which is what it currently does).

ftynse commented 6 years ago

I think that allowing inference to read values of the inputs and all the related changes to the execution model is an overcompilcated solution to a rather simple problem, and passing things like pooling sizes as values is not the right way to go. In polyhedral terminology, these things are parameters: they are scalars that remain constant throughout the execution. I don't see any other case where we would need to read values of the tensors (and specialize TC for each different value) that we can efficiently optimize.

The actual problem is that we want to avoid writing a separate TC for each of avgpool_3x3, avgpool_3x5, avgpool_5x3 etc. It is a well-known problem in languages, and there are different solutions: preprocessor (pytorch bindings do this), templates, generics...

If we consider a templated definition with stupid syntax

def avgpool_<kH>_<kW>(float(B, C, H, W) input) -> (output) {
    output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:$<kH>, kw in 0:$<kW>
}

it could be instantiated by calling avgpool_3_3 or avgpool_5_5, where TC semantic analysis would substitute actual 3 or 5 values in the code before passing it to range inference.

This approach offers better separation of concerns. Range inference does not need to care about values of inputs. It is already complex enough to make its outputs depend on the values... Also, if these parameters were passed as values, this would mean that the same function produces different-sized (read different-typed for PL folks) tensors depending on the values of inputs. A function should have a clear single signature. By inserting "template parameters" in the name, we can actually differentiate functions that produce tensors of different shape, and this shape is only dependent on input shapes and not on input values, which seems like a much saner choice.

nicolasvasilache commented 6 years ago

I have read the previous comments and I would like to add some insights about the difference between what we could do and what we should do. The following is a discussion from first principles about some of the requirements we have, with hints at the underlying mathematical reasons that are too long to explain fully here. The discussion follows a structured bottom-up approach going through the 3 Intermediate Representations (IRs) that we use. Always remember that we are treading a fine line between runtime behavior and compile-time behavior that we can decide to shift but we must do so carefully:

at the time of processing in polyhedral IR, we *must have linear affine accesses of the form Sum(constexpr_int index) (this does NOT mean the input TC must have linear affine accesses but this is out of the scope of the discussion). Without linear affine accesses at the time of polyhedral analysis, many bad things happen of which: dependency analysis becomes more complex, scheduling must be conservative and will loose (a lot of) parallelism, unknown local memory sizes which require terrible alternatives and a plethora of other things too long to mention. One potential solution to this is multi-versioning, let's not go there at this level of abstraction, it will blow up faster than you can imagine. Another solution is to infer and provide context under which we can statically prove the properties we need, this is doable but much easier said than done and it has to permeate the whole stack. Regardless os versioning or context inference, many of the remaining HPC compilation issues still remain. These are cold hard mathematical issues, you cannot negotiate with math. As a consequence and as @ftynse said the current sane choice is for the polyhedral IR to receive a fully static string with everything instantiated. We can relax some constraints on adding parameters in very specific, well-understood, case** but this is far outside the scope of this discussion;
as a consequence, at the time of processing with Halide IR, we need to receive a program representation and emit a fully constexpr'ed polyhedral IR. At this point you should note that the whole discussion is really about what should come into the Halide IR. At this level of abstraction both @abadams suggestion and what we do today (which is captured in @ftynse's suggestion) are valid. We could switch to a view of the world where Halide IR receives parameters and reads value from tensors / function parameters but it is opening a can of worms that we need to be aware of:
1. Halide becomes a critical path in the runtime flow: each time we receive a TC with such parameters we need to read the values at runtime and generate constexpr'ed polyhedral IR. Different values will trigger new compilations. So now Halide becomes the level that must talk to the compilation cache. This is major change that I don't think we are ready for at this point.
2. Following from the previous point, how are we going to deal with uncheckedRun: we have multiple use cases in the 10-100 microsecond latency regime, we have been very careful to avoid (CPU-side) mallocs and try to guard against any system call. If you think this is premature optimization, realize that 3FCRelu now run in 19us for our use case of interest and we are not even on Volta yet ... Any decision to put Halide in the critical path of uncheckedRun should be preceded by a careful overhead analysis, I am very interested in seeing those numbers and would be glad to be proven that my conservativeness is too strict here.
which brings me to the lang / TreeRef IR level: we will have "good" ways of writing TC and "bad" ones: if you want perf you should template (and we should support low-overhead mode, probably without going through Halide to access cached kernels), if you don't care about perf you can pass parameters. Are we ready for that level of generality? This also reflects the issue of separation of concerns that @ftynse mentioned.

I have left out many arguments out for the sake of readability. Please realize that there are no language decisions in a vacuum and that we should carefully lay out the issues at the polyhedral IR level because that is where optimizations happen for now. As we move towards more Halide transformations and automation at that level things may shift. Let's not fool ourselves, these are hard issues that one cannot half-solve.

We should be open to new suggestions and improvements but I do not think we are ready for the consequences of this one yet. @abadams if you still want to give it a shot, by all means please do, I'd be more than happy to be proven wrong. I'd just caution that it needs to be a full solution that permeates the whole stack and it's not just let's read scalars at runtime for inference. That is the tip, of the tip, of the iceberg.

abadams commented 6 years ago

In the following signature:

def avgpool(float(N) input, int K, int(2) shape) -> (output) {

There's very little distinction between the values of N, K, and shape(0) in the programmer's mind, and in the implementation of everything except for the polyhedral parts. If we continue to insist that N is somehow categorically different to K, people are going to be confused. I argue that this restriction is an unnecessary artifact of the current implementation, and this shouldn't inform the language design.

There is a fourth class of value, shape(i), which clearly cannot form part of bounds inference. It has a free variable in it. So given these four class:

N, K, shape(0), shape(i)

I propose that the wall between the things that are allowed in bounds inference and the things that are not either goes between shape(0) and shape(i), or in between K and shape(0). You guys seem to be proposing it goes between N and K (and that the value of N is part of the signature but the value of K is not). We can make any of them work by treating things as params. I don't think this is a hack - these things are fixed scalar params to each individual run.

Implementation-wise, my two proposed choices aren't currently different (because K is accessed by reading from a zero-D tensor), but I could see that there may be usability arguments for rejecting shape(0) - it's a simpler language rule. Placing the wall between K and shape(0) is also a little simpler for the current unchecked run, because we can blindly generate cache keys by mashing together all the input shapes plus all scalar params for which some flag is set (the values of zero-D tensors that we have statically recognized to affect bounds).

I think it's inevitable that we'll want to generate code for the CPU-side logic (generating cache keys, checking input/output sizes). We have the option for doing an unchecked run in Halide, and nobody ever uses it because it's no faster! This is because we compile all of our bounds checks, and the branch predictor just skips them. The feature only really exists to convince people that the asserts are free. Having every run be checked with no measurable overhead is a major usability benefit that we should aim for. We should also be generating our cache keys with compiled code. Doing both of these lets us allow many more things to affect bounds.

So my complaint is that 2.i is already happening in checked run - we're using Halide expression substitution as a shitty interpreter at runtime, where we should be compiling and running code. If we want to do anything more interesting with the cache keys (e.g. partial specialization), we're going to want compiled CPU code in unchecked run too. But really unchecked run shouldn't exist. It's possible to make checked run zero-measurable-overhead with compilation.

abadams commented 6 years ago

Note that it's straightforward to take a set of Halide Exprs in terms of some tensors and scalar parameters and jit-compile some code that evaluates them, either using our own codegen_llvm or Halide's. This would not be a large number of lines of code.

nicolasvasilache commented 6 years ago

So I think we have both the top down and the bottom up views summarized here, thanks for your input.

If I put those things together:

Having every run be checked with no measurable overhead is a major usability benefit that we should aim for It's possible to make checked run zero-measurable-overhead with compilation. Note that it's straightforward to take a set of Halide Exprs in terms of some tensors and scalar parameters and jit-compile some code that evaluates them

I would say then by all means let's go for it :)

ftynse commented 6 years ago

Anything will confuse users :) The mere fact of us having this discussion is evidence that the choices are not obvious for everybody...

def example(float(N) input, int32 K, int32(2) shape, float M) -> (output)

You guys seem to be proposing it goes between N and K (and that the value of N is part of the signature but the value of K is not)

I do, indeed. And there are two reasons for this proposal:

The definition above can be read as "argument input of type float(N)", "argument K of type int32", ... So there is a categorical difference between N and K, the former being part of a type of input and the latter being a name of an argument. Treating K and N similarly raises a bunch of questions. What is the type of N? Can N be assigned inside the code (K clearly can)? What happens if the user passes -42 as K? Do we treat the floating M the same way as K and N? What would be the difference between scalars (int32) and zero-dimensional tensors (int32()) ?
N can be extracted from tensor metadata while K, if treated as tensor input by the caller, from data. As @abadams mentioned above, it can exist only in GPU memory. In ExecutionEngine, we do pass around DLTensors with null data pointer, and that would have to change.

skimo-openhub commented 6 years ago

On Wed, Feb 28, 2018 at 08:51:54AM +0000, ftynse wrote:

Anything will confuse users :) The mere fact of us having this discussion is evidence that the choices are not obvious for everybody...
def example(float(N) input, int32 K, int32(2) shape, float M) -> (output)
You guys seem to be proposing it goes between N and K (and that the value of N is part of the signature but the value of K is not)

I do, indeed. And there are two reasons for this proposal:

The definition above can be read as "argument input of type float(N)", "argument K of type int32", ... So there is a categorical difference between N and K, the former being part of a type of input and the latter being a name of an argument. Treating K and N similarly raises a bunch of questions. What is the type of N? Can N be assigned inside the code (K clearly can)? What happens if the user passes -42 as K? Do we treat the floating M the same way as K and N? What would be the difference between scalars (int32) and zero-dimensional tensors (int32()) ?

N can be extracted from tensor metadata while K, if treated as tensor input by the caller, from data. As @abadams mentioned above, it can exist only in GPU memory. In ExecutionEngine, we do pass around DLTensors with null data pointer, and that would have to change.

I agree that N and K (when treated as a parameter) should be of the same kind and so I also feel it's confusing to have one untyped and the other one typed "int32".

On Tue, Feb 27, 2018 at 10:23:56AM +0000, ftynse wrote:

If we consider a templated definition with stupid syntax

def avgpool_<kH>_<kW>(float(B, C, H, W) input) -> (output) {
    output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:$<kH>, kw in 0:$<kW>
}

This proposal also treats kH differently from the size parameters B, C,... That is, why would I have to write "$" instead of just "kH"?

I also think we should separate the issues of how to expose this to the user and how to implement it internally. That is, even if we "pass" something to a TC function, that doesn't have to mean we have to store it on the GPU.

Perhaps we could have something like

def example(float(N) input, parameter K, parameter(2) shape, float M) -> (output)

(although I would focus on separate parameters first, so not shape for now).

"parameter" is just the first thing that comes to mind. It could be any other name, as long as it is explicit about what it is used for and that, in particular, you would not be able to assign a value inside the TC.

skimo