Open prigoyal opened 6 years ago
also came through #132
Hi everyone (cc @zdevito for language perspective input), I have been thinking about this problem and looking at the code. I wrote down a proposal for it: https://gist.github.com/prigoyal/3c411fcf994c0069a11f9df0733e70f2 and would appreciate any feedback/comments
also cc @salexspb
Why don't you paste the content of that gist in the discussion? It's much better to keep everything in a single place.
WIP: thoughts, comments welcome
Let's say we have two layers: fully connected and relu layer that we would like to fuse together. We express the fully connected layer and Relu layer in TC as below:
def fully_connected(float(B, M) I, float(N, M) W1, float(N) B1) -> (O1) {
O1(b, n) +=! I(b, m) * W1(n, m)
O1(b, n) = O1(b, n) + B1(n)
}
def relu(float(B, M) I) -> (O1) {
O1(b, n) *=* fmax(I(b, n), 0)
}
Currently, we write fused Fully connected - relu layer as below:
def fcrelu(float(B,M) I, float(N,M) W1, float(N) B1) -> (O1) {
O1(b, n) +=! I(b, m) * W1(n, m)
O1(b, n) = O1(b, n) + B1(n)
O1(b, n) = fmax(O1(b, n), 0)
}
This means that TC is write-only language i.e. we have to explicitly express the operations by writing them irrespective of fusion or not. It would be nice if we could just write:
def fcrelu(float(B,M) I, float(N,M) W1, float(N) B1) -> (O1) {
O1(b, n) = fully_connected(I(b,m), W1(n,m), B1(n))
O1(b, n) = relu(O1(b, n), 0)
}
OR
def fcrelu(float(B,M) I, float(N,M) W1, float(N) B1) -> (O1) {
O1(b, n) = relu(fully_connected(I(b,m), W1(n,m), B1(n)))
}
The benefits of doing this are:
In order to achieve above, there are two approaches: 1) string inlining by formatting the strings 2) manipulate the ASTs for the TC defs.
While 1) is an easy approach, it is also a hack and can get ugly, prone to errors. But 2) is the most natural way to handle this properly. Let's see how to solve this with 2). We will consider the TC:
def fcrelu(float(B,M) I, float(N,M) W1, float(N) B1) -> (O1) {
O1(b, n) = fully_connected(I(b,m), W1(n,m), B1(n))
O1(b, n) = relu(O1(b, n), 0)
}
TK_CUSTOM_TYPE
. A node is a marked as custom type if it makes function calls to other TCs. The node view is similar to node type TK_BUILT_IN
(name, arguments, type) except that the calls are made to some TC operation. Further, we store the information returns which is the list of return outputs from this comprehension.TK_CUSTOM_TYPE
node with the list of TK_APPLY
nodes. Further, the comprehension with TK_CUSTOM_TYPE
will become list of comprehensions which should then be appended to the original comprehension list. TK_CUSTOM_TYPE
, retrieve the TC def of the custom type from some source (libraries.h for example). We parse that TC def but replace the names of params, idents by the arguments idents to TC_CUSTOM_TYPE
node. Similarly for the output. We get the list of comprehensions of this parsed function and append that list to the Def node.On Tue, May 08, 2018 at 08:57:07PM +0000, Priya Goyal wrote:
Design: Support for Nested Function calls in TC
WIP: thoughts, comments welcome
Let's say we have two layers: fully connected and relu layer that we would like to fuse together. We express the fully connected layer and Relu layer in TC as below:
def fully_connected(float(B, M) I, float(N, M) W1, float(N) B1) -> (O1) { O1(b, n) +=! I(b, m) * W1(n, m) O1(b, n) = O1(b, n) + B1(n) } def relu(float(B, M) I) -> (O1) { O1(b, n) *=* fmax(I(b, n), 0) }
Currently, we write fused Fully connected - relu layer as below:
def fcrelu(float(B,M) I, float(N,M) W1, float(N) B1) -> (O1) { O1(b, n) +=! I(b, m) * W1(n, m) O1(b, n) = O1(b, n) + B1(n) O1(b, n) = fmax(O1(b, n), 0) }
This means that TC is write-only language i.e. we have to explicitly express the operations by writing them irrespective of fusion or not. It would be nice if we could just write:
def fcrelu(float(B,M) I, float(N,M) W1, float(N) B1) -> (O1) { O1(b, n) = fully_connected(I(b,m), W1(n,m), B1(n))
This looks as if fully_connected
is performing a scalar operation.
I would expect a call like O = fully_connected(I, W1, B1)
for the fully_connected
defined above.
skimo
Actually, this O1(b, n) = relu(O1(b, n), 0)
too looks like a scalar operation. It applies relu
to the element of O1
at (b,n)
This is a very useful feature indeed. A couple of comments.
This means that TC is write-only language
We have a different definition of write-only, so let's not abuse the term :) TC does not supports reuse of functions natively (favoring code copy-pasting), but it is designed to be readable, which is the opposite of write-only for me.
Bridging the gap between TC style and Lush/Torch/PyTorch/TF style of tensor operations.
I'd need to see example of that style. We don't want TC to end up as yet another framework.
string inlining by formatting the strings
We claim we are doing better than other tools because we don't do string stitching :)
Create a special Token Type, let's call it TK_CUSTOM_TYPE.
Yep, let's call it TK_CALL
and transform some TK_APPLY
instances to it.
retrieve the TC def of the custom type from some source
This is mostly irrelevant here. Assume a program that has all the relevant definitions lexically before the first calls. We can think about modules and libraries once the base version works.
And a couple of questions:
O(i,j) = A(i,j) + matmul(B(i,k),C(k,j))
? you cannot just replace that by a list in the AST, there expected type is expression rather than list at the point where matmul is called.relu(fully_connected(I(b,m), W1(n,m), B1(n)))
how is this different from fabs(fully_connected(I(b,m), W1(n,m), B1(n)))
syntactically or semantically? note that fabs
is applied pointwise while relu
is not.@ftynse
- what happens if the function you call has multiple outputs, some of which may be temporaries ?you can only assign one tensor in one expression.
We should be fine with just implementing temporary support with custom allocator and then have a function that parses and queries the temporary size. We can then allocate scratch space (with the delegated allocator that won't call cudaMalloc) and put temporaries in there. I think @zdevito made this point a few months ago. But we will prob. also need to support things like:
(O1, O2, O3) = xxx();
and have extra support for naming "hidden" temporaries and multiple output / multiple input which will get us into fun error reporting situations..
- how do you apply a function to a slice of a tensor?
We just need strided tensor support, a slice of a tensor is a metadata operation. Coming up with good semantics that are not too framework-y is another story. Reusability vs implicit semantics is a tradeoff (think broadcast and reduce semantics .. gahhhh). For one I would much rather lose reusability than see implicit semantics.
- why iteration variables appear in the calls at all, if calls operate on entire tensors ?
That's a pretty tricky one .. do we want to support operations on elements, matrices or tensors. Extrapolate, what does it mean to multiply 2 tensors? Then you get in the business of having to name your functions to express their intent in their name (conv2dNCHW) and poof you're back to being a framework and you have lost the niceness of your explicit mathematical language. If we get too extreme in that direction, we might as well do an ONNX -> ISL and save us some trouble ..
- what happens in expressions like O(i,j) = A(i,j) + matmul(B(i,k),C(k,j))? you cannot just replace that by a list in the AST, there expected type is expression rather than list at the point where matmul is called.
This one feels like the type of transformation I was mentioning earlier which I would now qualify as expression propagation. It comes with potential tradeoffs between storage and compute complexity. Expressed as is it is also related to function inlining but it has more flavor to it.
relu(fully_connected(I(b,m), W1(n,m), B1(n))) how is this different from fabs(fully_connected(I(b,m), W1(n,m), B1(n))) syntactically or semantically? note that fabs is applied pointwise while relu is not.
Well all the interesting problems about temporaries etc are hidden by the fact that the example chosen is trivial and can be expressed as a simple map combinator. IMO something like fc(relu(fc(I(xxx), W1(xxx), B1(xxx), xxx, xxx))
where I don't know yet what xxx is (if anything) is more interesting.
how do you handle recursive calls?
Don't? Please, no stack on the GPU :p
There is some interest in being able to do nested function calls i.e. define one TC and call it from inside another one.
This makes a lot of sense as well for cases where you want to use a defined database of TC and just make calls.
[EDIT]: pasting here @leonbottou 's remarks: