Open prigoyal opened 6 years ago
Those inputs are 1-D, not scalar. I think you want:
def avgpool(float(B, C, H, W) input, float kH, float kW, float sH, float sW) -> (output) {
output(b, c, h, w) += input(b, c, h * sH + kh, w * sW + kw) where kh in 0:kH, kw in 0:kW
}
def avgpool(float(B, C, H, W) input, float kH, float kW) -> (output) {
output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:kH, kw in 0:kW
}
yep, thanks for catching, updating the description :)
for more information, copy pasting some information from our internal discussion on this:
Andrew: "The issue is that it requires us to do compute outside of the scope where we allocate outputs. E.g. we need to load from input tensors before we allocate outputs. But input tensors may only exist in device memory, etc..."
Andrew: "It requires a change to our model of computation."
Andrew: "One potential approach is to special-case zero-dimensional tensors. But that probably requires a bunch of pytorch plumbing changes to treat them as actual scalars instead of zero-d tensors and it would be disappointing/surprising if you can use t() in a where clause but not t(3) E.g. you may want to pass the pooling shape as a 1-d tensor with two entries"
Those PyTorch plumbing changes have mostly already happened and 0.4 will have first-class scalars
By pytorch plumbing changes I meant changes within TC's flow from inputs into the kernel more generally. Even with these plumbing changes, I'd see there being a type-system difference between t() and t(3) to be a language wart. I think we just need to accept that bounds evaluation will potentially need to load values from tensors as well as inspect their shape (which is what it currently does).
I think that allowing inference to read values of the inputs and all the related changes to the execution model is an overcompilcated solution to a rather simple problem, and passing things like pooling sizes as values is not the right way to go. In polyhedral terminology, these things are parameters: they are scalars that remain constant throughout the execution. I don't see any other case where we would need to read values of the tensors (and specialize TC for each different value) that we can efficiently optimize.
The actual problem is that we want to avoid writing a separate TC for each of avgpool_3x3, avgpool_3x5, avgpool_5x3 etc. It is a well-known problem in languages, and there are different solutions: preprocessor (pytorch bindings do this), templates, generics...
If we consider a templated definition with stupid syntax
def avgpool_<kH>_<kW>(float(B, C, H, W) input) -> (output) {
output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:$<kH>, kw in 0:$<kW>
}
it could be instantiated by calling avgpool_3_3
or avgpool_5_5
, where TC semantic analysis would substitute actual 3 or 5 values in the code before passing it to range inference.
This approach offers better separation of concerns. Range inference does not need to care about values of inputs. It is already complex enough to make its outputs depend on the values... Also, if these parameters were passed as values, this would mean that the same function produces different-sized (read different-typed for PL folks) tensors depending on the values of inputs. A function should have a clear single signature. By inserting "template parameters" in the name, we can actually differentiate functions that produce tensors of different shape, and this shape is only dependent on input shapes and not on input values, which seems like a much saner choice.
I have read the previous comments and I would like to add some insights about the difference between what we could do and what we should do. The following is a discussion from first principles about some of the requirements we have, with hints at the underlying mathematical reasons that are too long to explain fully here. The discussion follows a structured bottom-up approach going through the 3 Intermediate Representations (IRs) that we use. Always remember that we are treading a fine line between runtime behavior and compile-time behavior that we can decide to shift but we must do so carefully:
uncheckedRun
: we have multiple use cases in the 10-100 microsecond latency regime, we have been very careful to avoid (CPU-side) mallocs and try to guard against any system call. If you think this is premature optimization, realize that 3FCRelu now run in 19us for our use case of interest and we are not even on Volta yet ... Any decision to put Halide in the critical path of uncheckedRun
should be preceded by a careful overhead analysis, I am very interested in seeing those numbers and would be glad to be proven that my conservativeness is too strict here.I have left out many arguments out for the sake of readability. Please realize that there are no language decisions in a vacuum and that we should carefully lay out the issues at the polyhedral IR level because that is where optimizations happen for now. As we move towards more Halide transformations and automation at that level things may shift. Let's not fool ourselves, these are hard issues that one cannot half-solve.
We should be open to new suggestions and improvements but I do not think we are ready for the consequences of this one yet. @abadams if you still want to give it a shot, by all means please do, I'd be more than happy to be proven wrong. I'd just caution that it needs to be a full solution that permeates the whole stack and it's not just let's read scalars at runtime for inference. That is the tip, of the tip, of the iceberg.
In the following signature:
def avgpool(float(N) input, int K, int(2) shape) -> (output) {
There's very little distinction between the values of N, K, and shape(0) in the programmer's mind, and in the implementation of everything except for the polyhedral parts. If we continue to insist that N is somehow categorically different to K, people are going to be confused. I argue that this restriction is an unnecessary artifact of the current implementation, and this shouldn't inform the language design.
There is a fourth class of value, shape(i), which clearly cannot form part of bounds inference. It has a free variable in it. So given these four class:
N, K, shape(0), shape(i)
I propose that the wall between the things that are allowed in bounds inference and the things that are not either goes between shape(0) and shape(i), or in between K and shape(0). You guys seem to be proposing it goes between N and K (and that the value of N is part of the signature but the value of K is not). We can make any of them work by treating things as params. I don't think this is a hack - these things are fixed scalar params to each individual run.
Implementation-wise, my two proposed choices aren't currently different (because K is accessed by reading from a zero-D tensor), but I could see that there may be usability arguments for rejecting shape(0) - it's a simpler language rule. Placing the wall between K and shape(0) is also a little simpler for the current unchecked run, because we can blindly generate cache keys by mashing together all the input shapes plus all scalar params for which some flag is set (the values of zero-D tensors that we have statically recognized to affect bounds).
I think it's inevitable that we'll want to generate code for the CPU-side logic (generating cache keys, checking input/output sizes). We have the option for doing an unchecked run in Halide, and nobody ever uses it because it's no faster! This is because we compile all of our bounds checks, and the branch predictor just skips them. The feature only really exists to convince people that the asserts are free. Having every run be checked with no measurable overhead is a major usability benefit that we should aim for. We should also be generating our cache keys with compiled code. Doing both of these lets us allow many more things to affect bounds.
So my complaint is that 2.i is already happening in checked run - we're using Halide expression substitution as a shitty interpreter at runtime, where we should be compiling and running code. If we want to do anything more interesting with the cache keys (e.g. partial specialization), we're going to want compiled CPU code in unchecked run too. But really unchecked run shouldn't exist. It's possible to make checked run zero-measurable-overhead with compilation.
Note that it's straightforward to take a set of Halide Exprs in terms of some tensors and scalar parameters and jit-compile some code that evaluates them, either using our own codegen_llvm or Halide's. This would not be a large number of lines of code.
So I think we have both the top down and the bottom up views summarized here, thanks for your input.
If I put those things together:
Having every run be checked with no measurable overhead is a major usability benefit that we should aim for It's possible to make checked run zero-measurable-overhead with compilation. Note that it's straightforward to take a set of Halide Exprs in terms of some tensors and scalar parameters and jit-compile some code that evaluates them
I would say then by all means let's go for it :)
Anything will confuse users :) The mere fact of us having this discussion is evidence that the choices are not obvious for everybody...
def example(float(N) input, int32 K, int32(2) shape, float M) -> (output)
You guys seem to be proposing it goes between N and K (and that the value of N is part of the signature but the value of K is not)
I do, indeed. And there are two reasons for this proposal:
input
of type float(N)
", "argument K
of type int32
", ... So there is a categorical difference between N
and K
, the former being part of a type of input
and the latter being a name of an argument. Treating K and N similarly raises a bunch of questions. What is the type of N
? Can N
be assigned inside the code (K
clearly can)? What happens if the user passes -42
as K
? Do we treat the floating M
the same way as K
and N
? What would be the difference between scalars (int32
) and zero-dimensional tensors (int32()
) ?N
can be extracted from tensor metadata while K
, if treated as tensor input by the caller, from data. As @abadams mentioned above, it can exist only in GPU memory. In ExecutionEngine
, we do pass around DLTensors
with null data pointer, and that would have to change.On Wed, Feb 28, 2018 at 08:51:54AM +0000, ftynse wrote:
Anything will confuse users :) The mere fact of us having this discussion is evidence that the choices are not obvious for everybody...
def example(float(N) input, int32 K, int32(2) shape, float M) -> (output)
You guys seem to be proposing it goes between N and K (and that the value of N is part of the signature but the value of K is not)
I do, indeed. And there are two reasons for this proposal:
- The definition above can be read as "argument
input
of typefloat(N)
", "argumentK
of typeint32
", ... So there is a categorical difference betweenN
andK
, the former being part of a type ofinput
and the latter being a name of an argument. Treating K and N similarly raises a bunch of questions. What is the type ofN
? CanN
be assigned inside the code (K
clearly can)? What happens if the user passes-42
asK
? Do we treat the floatingM
the same way asK
andN
? What would be the difference between scalars (int32
) and zero-dimensional tensors (int32()
) ?N
can be extracted from tensor metadata whileK
, if treated as tensor input by the caller, from data. As @abadams mentioned above, it can exist only in GPU memory. InExecutionEngine
, we do pass aroundDLTensors
with null data pointer, and that would have to change.
I agree that N and K (when treated as a parameter) should be of the same kind and so I also feel it's confusing to have one untyped and the other one typed "int32".
On Tue, Feb 27, 2018 at 10:23:56AM +0000, ftynse wrote:
If we consider a templated definition with stupid syntax
def avgpool_<kH>_<kW>(float(B, C, H, W) input) -> (output) { output(b, c, h, w) += input(b, c, h + kh, w + kw) where kh in 0:$<kH>, kw in 0:$<kW> }
This proposal also treats kH differently from the size parameters B, C,...
That is, why would I have to write "$
I also think we should separate the issues of how to expose this to the user and how to implement it internally. That is, even if we "pass" something to a TC function, that doesn't have to mean we have to store it on the GPU.
Perhaps we could have something like
def example(float(N) input, parameter K, parameter(2) shape, float M) -> (output)
(although I would focus on separate parameters first,
so not shape
for now).
"parameter" is just the first thing that comes to mind. It could be any other name, as long as it is explicit about what it is used for and that, in particular, you would not be able to assign a value inside the TC.
skimo
We can't use scalar inputs in the bounds inference right now. So for example:
this will fail.
The workaround right now is to do proper substitution for those scalars in the TC before passing them to the backend.