[Relay][RFC] Automatic Differentiation

MarisaKirisame commented 5 years ago

This RFC aims to pave the road to add automatic differentiation (AD) to Relay, both a first-order and a higher-order algorithm. An AD algorithm calculates the gradient of a program, which is needed for training (back propagation). Additionally, higher-order gradients are sometimes needed for other optimization algorithms such as Newton's method.

Because Relay support closures and control flow, and there are further plans to add features like algebraic data type (#2175) , our implementation of AD must support back propagation over these constructs. For the implementation, we plan to closely follow "Demystifying Differentiable Programming." On a high level, our approach generates a graph at runtime and runs reverse-mode automatic differentiation on it as if it were a static graph. This algorithm easily supports closures, control flow, and ADTs and it can be extended to account for further language features.

Differentiating programs that include operators requires us to register implementations of gradients for the different operators, which we will include as attributes. More specfically, for an operator f : <x0, x1, x2...> -> y, the gradient of the operator is f : <x0, x1, x2...> -> <y, y -> <x0, x1, x2...>> It's signature in C++ is runtime::TypedPackedFunc<Expr()>. The signature for this attibute is open for discussion: we can scale other forms to the higher-order case as well.

First-order AD can be easily optimized and does not require any further language features beyond gradients for operators. AD on Higher-order program in the manner of demyst would require us to add OCaml-style references to Relay, which we plan to submit soon as a PR.

We would appreciate the community's feedback on this outline for implementing automatic differentiation in Relay. We welcome and would be glad to respond to any comments regarding further details about the implementation of AD and necessary steps to incorporate into Relay. Thanks @slyubomirsky for helping me write this RFC.

jroesch commented 5 years ago

Could you provide an example implementation of the primal gradients?

masahi commented 5 years ago

Hi @MarisaKirisame, can you suggest some fast-path background readings to understand AD implementation in Relay? Besides the "Demystifying" paper?

MarisaKirisame commented 5 years ago

@masahi it is the fastest path.

It use delimited continuation so it is confusing, but it can be explained without.

Essentially, for every Double expression d, we transform it to an expression of type (Double, Ref Double). It is a tuple which hold two value: the original value, and the gradient of that value.

There is also a global Ref (() -> ()) function (it take no argument, produce an empty tuple) called backward which is resposible for reading the gradient, and writing it upstream.

take expression x + y where x and y are subexpressions for example. we will 0: transform x and y into pair (x, xref) and (y, yref) 1: generate

let zref = new 0;
let old_backward = !backward;
let new_backward = fun() { xref <- !xref + !zref; yref <- !yref + !zref; old_backward() };
backward <- new_backward;
(x + y, zref)

There will be wrapper code which initialize a backward function which does nothing, and convert between Double and (Double, Ref Double)

it is essentially the same for tensor.

MarisaKirisame commented 5 years ago

@tqchen as per the request the signature in C++ side is runtime::TypedPackedFunc<Expr()> I have updated it in above as well.

tqchen commented 5 years ago

@MarisaKirisame can you give a specific comment on how is this FGradient expected to be used? TypedPackedFunc<Expr()> does not really make sense because it does not take in input expressions.

I think the signature should be like the follows

TypedPackedFunc<Array<Expr> (Expr orig_call, Expr out_grad)>

Take in original call node, out_grad, and output, return a list of gradient wrt to each input

MarisaKirisame commented 5 years ago

@tqchen it is a function expression taking orig_call, and will return a tuple. the right hand side will take out_grad. (it is a function in relay but not a function in C++).

tqchen commented 5 years ago

I see what you mean, it is possible to express simple function's gradient purely in relay. So conceptually, it is more like:

TypedPackedFunc<RelayFunc<Tuple<Expr>(Expr out_grad)> (Expr orig_call)>

Is there any specific motivation to actually construct gradient in this way? Can we discuss the pros/cons of the alternatives? Note that, we will need to have a C++ function that takes in orig_call, because we will need to handle attributes to the primitive calls, like axis in sum(x, axis=1)

botev commented 5 years ago

Out of curiosity have you guys ever considered rather than having gradient functions (which usually define Reverse Mode AD) to have lambdas which construct the whole Jacobian operator. This allows for automatically getting both Forward and Reverse mode AD. Of course one would not "explicitely" store the whole Jacobian, but just construct the linear operator. E.g. for matmul:

z = matmul(x, y)
jacobian_dz_dx = implicit_kron(y, identity)
jacobian_dz_dy = implicit_kron(identity, transpose(x))

Here implicit_kron should not represent an actual operator that constructs the kronecker product, but rather defining an object that can compute efficiently the linear operator actions.

First-order AD can be easily optimized and does not require any further language features beyond gradients for operators. Higher-order AD in the manner of demyst would require us to add OCaml-style references to Relay, which we plan to submit soon as a PR.

@MarisaKirisame could you clarify this point? It seems very counterintuitive that somehow first order derivatives are "more special" then any other order. Technically differentiation should be something applicable to any graph, regardless if it comes from differentiation or not.

MarisaKirisame commented 5 years ago

@botev Higher order here mean AD in higher order language. It is indeed true that higher order AD of first order program need no further feature. We value the closure property (befor/after AD is in the same language) so for whatever AD algorithm you choose we can hopefully do it multiple time.

Sorry for the confusion, i had updated the RFC.

We had not thought about supporting forward mode as of right now, as our main priority is to get back propagation working.

reminisce commented 5 years ago

Maybe this is not in the scope of this RFC or extra work needs to be added, but can this feature be extended to support AD in imperative programming? NNVM is able to achieve that purpose and that's how it works in MXNet/Gluon APIs. Just want to make sure this design has taken imperative programming into consideration.

Eventually, can we achieve the UX as in https://github.com/HIPS/autograd?

MarisaKirisame commented 5 years ago

@reminisce for now we only support closed term, but we are moving to there. wdym by imperative programming? can you explain more? am confused.

reminisce commented 5 years ago

@MarisaKirisame Sorry for the confusion. Here is an introduction of imperative and symbolic programming in MXNet.

Basically, imperative programming means the computation of operators are executed one by one so that users can still get intermediate layer outputs of a model. In the AD PR, it seems that current interface (Evaluator) optimizes both forward and backward graphs together so that users won't be able to get intermediate outputs of the forward graph.

The ideal use case would be when the program invokes operators one by one, a forward graph is constructed on the fly. When it starts to run backpropagation, it first constructs a backward graph and runs the backward graph independently. Any idea how to use Relay to realize this?

MarisaKirisame commented 5 years ago

@reminisce this is an orthgonal problem. we are working on converting a define-by-run(imperative model) to a define-then-run(symbolic model)

tqchen commented 5 years ago

My understanding is that the current signature support imperative AD as well

reminisce commented 5 years ago

@MarisaKirisame Thanks. That's what I thought. No pressure to put it in this PR. Has an RFC been submitted?

@tqchen Could you elaborate your thoughts? My understanding is that we will need to attach node info to NDArray in the forward pass to construct Relay AST at runtime, and create a backward AST for running AD. Thanks.

tqchen commented 5 years ago

The current AD signature: Array<Expr> (Expr orig_call, Expr out_grad) constructs the gradient AST for a given input.

Imperative AD can make use of this signature along with a high-level taping structure(possibly attached to NDArray of the framework) to get the gradient. So you don't need another separate signature for imperative AD.

reminisce commented 5 years ago

@tqchen Thanks for the answer. I will take a look.

MarisaKirisame commented 5 years ago

@reminisce not yet. it is still wip.

MarisaKirisame commented 5 years ago

AD is finished in #2496 .

apache / tvm

[Relay][RFC] Automatic Differentiation #2237