source code transformation

mlubin commented 11 years ago

I'd like to propose using source code transformation instead of operator overloading. Operator overloading is slow in general and not really optimized in Julia; the compiler doesn't seem to optimize away the temporary objects, and the GC isn't great at handling them.

On the other hand, Julia makes implementing source code transformation much easier than in other languages. The AST of a function is (mostly) user accessible:

julia> x(t) = 2t

julia> methods(x).defs.func.code
AST(:($(Expr(:lambda, {:t}, {{}, {{:t, :Any, 0}}, {}}, quote  # none, line 1:
        return *(2,t)
    end))))

The transformed functions can be compiled with the JIT, so that AD incurs only a one-time cost. With this approach, AD in Julia will be very competitive with the state of the art.

Of course, development time is limited, and I'm not sure how much I'll be able to work on this in the short term. It is certainly reasonable to go forward with an operator overloading approach just to get something working and to have a basis for comparison.

papamarkou commented 11 years ago

This is helpful feedback Miles. It is a matter of time and knowledge. We could start with operator overloading which seems faster to set it up, and once we have the main functionality available, we can then look into optimization via source code transformation. For instance, I need access to a book and at the moment my uni's library has none on automatic differentiation (I ordered Griewank's book via the library). If in the meantime someone who is more experienced with AD has the time to implement source code transformation, he is more than welcome to contribute :)

tshort commented 11 years ago

Miles, why should operator overload be slow in this case? I understand it when we were talking about manipulating expressions, but here, everything should be compiled by the JIT.

Based on reading, I think that source transformation can allow more optimizations. I'm a newbie at this, too.

mlubin commented 11 years ago

It will be compiled, but AFAIK Julia doesn't (yet) elide temporary objects, y + x*cos(x^2) will make ~3 temporary objects, one for x^2, one for cos(x^2), etc. If these are single floats then the compiler will be smarter and combine the operations.

kmsquire commented 11 years ago

I agree that source code transformation is a good direction to go, but for me also, this is treading into new territory.

mlubin commented 11 years ago

Seems like someone will have to step forward and decide to become an expert in AD :)

papamarkou commented 11 years ago

It seems an interesting topic, happy to do this once I get a grip of Griewank's book ;)

tshort commented 11 years ago

I just pushed some code that does a simple source code transformation. It's throw-away code, but it does suggest that this approach might be a viable option in Julia. It uses Calculus.differentiate to symbolically find the derivative of individual code lines. A good option might be to try source transformation on functions/sections/lines, and if it fails, fall back to operator overloading or numerical approximation.

tshort commented 11 years ago

I added some timings in the test directory. Here are results for a basic scalar case with a polynomial function:

# base function           elapsed time: 0.110321383 seconds
# complex numbers         elapsed time: 0.255644874 seconds
# Dual numbers            elapsed time: 0.397590266 seconds
# ADForward               elapsed time: 12.348061681 second
# source transformation   elapsed time: 0.148293506 seconds

It'll get more interesting when we get to multivariable cases.

mlubin commented 11 years ago

Cool results, thanks for sharing!

bsxfan commented 11 years ago

Hi All

I am very new to Julia, but I have done quite a lot of coding in MATLAB to handle complicated differentiation challenges. I have done forward-mode and reverse-mode AD stuff, implemented via operator overloading and source code generation. Let me share my experiences and conclusions:

My interest has been mainly to do numerical optimization of scalar objective functions of many variables (R^n->R). For optimization, it is really nice to be able to compute the gradient (and also sometimes some limited 2nd-order derivatives).

Derivatives of a simple function are usually easy to derive and code by hand, but when functions are composed, the complexity quickly explodes so that hand derivation is no fun anymore and coding errors are virtually guaranteed. So we need a nice way to handle derivatives for function composition. Consider the composition f(x) = a(b(c(x))). Let x, c(x), and b(c) be vectors, and a(b) be scalar. Then we have the chain rule:

grad_f = J'_c * J'_b * grad_a

where J denotes Jacobian matrix and ' transpose. Forward-mode AD effectively does this as:

(J'_c * J'_b) * grad_a

i.e. as a matrix-matrix multiplication (slow), followed by a matrix-vector multiplication. Reverse-mode AD does:

J'_c * (J'_b * grad_a)

which is just two matrix-vector products. If the matrices are large and you have longer chains of compositions, you really want to do reverse-mode, rather than forward mode. Unfortunately it is also a lot more difficult to code.

So I tried implementing my own reverse-mode solutions in MATLAB. I used operator overloading, done in such a way as to build up a structure of funtion handles. When the returned handle is called it backpropagates the desired gradient. Although this worked, I gave it up because it was ridiculously slow. The problem is with the way in which MATLAB cleans up objects (also function handles) when they go out of scope. I found some discussion forums about this problem and came to the conclusion that you just don't want to be using MATLAB when you have many objects referencing each other. My solution was to adapt my code so that the function handles---rather than actually doing the calcualtions---just generated code (an actual m-file on disk). The code generation remained slow, but once the code is generated, it runs as fast as any other MATLAB code and this can then be used in the numerical optimization algorithm, which of course has to compute gradients over and over as it iterates.

I was very pleased with this solution for a while, but then found myself drifting away from it. It just ended up having too many limitations---for example it could not really handle loops in a nice way. My currrent way to handle derivatives (in MATLAB) is the following:

Structure your whole calculation into manageable function blocks. Use a structured way to hook blocks and their derivatives up, respecting the chain rule. Hooking up blocks is not too tricky as long as you remember to add gradients when multiple backpropagating gradients converge on the same node.
Hand-code derivatives of simple functional blocks. For this it is really good to get to grips with 'matrix calculus'. This may sometimes be a little tricky, because you need to be able to backpropagate gradients, i.e. to multiply by the transpose of the Jacobian of your function block. But, with practice and a few useful tricks, one can get used to it.
Test the hand-coded derivatives of every block by comparing the outputs against automatic forward-mode differentiation. For the full-scale problem, forward-mode AD would be too slow. But you can usually manage to scale the problem down to allow fast testing. Also test that the derivatives still work properly when you have hooked up all blocks.
Build up a library of useful function blocks once they have been tested.

For my forward-mode AD testing, I use both dual numbers and the complex-step trick. I compare them both against my hand-coded backpropagation result and also to each other. The complex and dual methods are very similar (and in my MATLAB implementation have very similar execution speed and memory requirements). But they have different pitfalls.

For the complex-step solution, you have to be very careful with conjugate transposes. What you usually want to to use in your code is the .' transpose operator, which does not conjugate.The complex step stuff also generally does not work when complex numbers are already involved in your calculation. You also can't apply this trick recursively to get 2nd -order derivatives. You also need to be careful with square roots and choleski transforms and logarithms and all kinds of functions which have restrictions on their input domain.
For the dual number solution, you have to hand-code every new function (and operator) you want to handle. Errors can creep in here.

Depending on your dual number implementation, you can apply them recursively. You can use complex values inside the dual number, or even nest dual numbers. In this way you can do 2nd-order derivatives.

In MATLAB I ended up using an operator-overloaded Dual Matrix type rather than a Dual number. The dual matrix has two matrix-valed components. This is in contrast to having matrices of dual numbers. This was done for practical reasons. I overloaded the matrix operators in such a way that the matrix operations are then still done efficiently by BLAS and LAPACK. If one has matrices of dual numbers, then one loses fast linear algebra!

bsxfan

papamarkou commented 11 years ago

Hi bsxfan, this is extremely useful information. I would like to ask you two questions in relation to the code I am currently trying to write. I am trying to replace the existing ADForward type, which has a number and a vector element and which is meant basically to handle functions f:R^n->R. The new type I am writing is meant to handle functions f:R^n->R^m. This means that it has an m-length vector element, which holds the value of the function, and an m*n matrix, which is the Jacobian (the gradient if m=1). Since I am now trying to overload the operators for this new type, I wanted to ask you how you chose to do the overloading so that it can still be done efficiently by BLAS and LAPACK (for example is it efficient to use nested for loops?). Any advice or coding contribution is greatly appreciated.

The second question is whether you have ever tried to implement hyperdual numbers for the computation of higher-order derivatives; see the link below re hyperduals:

http://www.stanford.edu/~jfike/

papamarkou commented 11 years ago

I updated Jonas' code for forward AD:

1) The API has been made more compact. To see how it looks like, try running the test/gradual_ad_test.jl script. 2) Conversion, promotion, and dual arithmetic have been improved in places and some relevant bugs have been removed. 3) A small number of dual operators (such as multiplication for example) have been sped up a little bit via devectorization. 4) Naming conventions have been changes in Jonas' script. One of the major name changes you will notice is that the ADForward type has been renamed to GraDual. I thought carefully before making this change. ADForward is too specific referring only to the forward mode. The type under consideration is a generalization of dual numbers by simply considering n-length gradients instead of 1-dimensional ones (aka derivatives). The GraDual type may be used in other places, either in AD or elsewhere, so I thought a more generic name was needed. This name should carry the information that the dual arithmetic is retained and extended to "n-derivative dimensions". Names such as Tuple would not qualify in this respect. Then names such as Tual or Nual would be way too cryptic. GraDual carries two pieces of info in its name, one being dual arithmetic and the second one being that this arithmetic is extended from 1 to n (thanks to the gra prefix). GradDual looks ugly. So I chose GraDual, which of course is a word game, but it is not more ambiguous than... complex for instance. I hope you will like it.

P.S. The rest of tests may need to be changed to account for these generic changes - test/gradual_ad_test.jl can be used as a guide.

P.S. 2: This forward AD approach is hopefully solid enough. It aims at being broad enough to cover various differentiation problems. It makes no claims as being fast, but it can be used as a standard approach in case everything else fails (and is also a way of ensuring that future approaches will give correct results).

P.S. 3: I will write two more routines in src/gradual.jl for calculating the Hessian and tensors (probably by tomorrow).

tshort commented 11 years ago

The current source transformation code only works for simple cases. It uses Calculus.differentiate which can differentiate :( x^3 + exp(x) ) but it cannot handle something more complex like:

y = x^3 
z = y + exp(x)

To make this work, we need a couple of changes:

Change Calculus.differentiate to support differentiating with respect to an array index, so we can differentiate :( x[1] ^ 3 + 2 * x[1] ) with respect to x[1]. This shouldn't be too hard.
We could pull in the code from Calculus.differentiate or add methods that allow differentiating Dual objects where the list of active variables is known. This might require differentiate methods like the following:

differentiate(::SymbolParameter{:sin}, args, wrt, isactive::Dict{Symbol, Bool})

where isactive has the list of active variables.

papamarkou commented 11 years ago

Cheers for the tips Tom, I will look into this and will make the changes (unless someone else does it faster than me, which is always a welcomed contribution).

papamarkou commented 11 years ago

Tom, I am trying to extend your code in source_transformation.jl according to your suggestions above. I want to ask you a few questions so as to understand some parts of it, before making any changes:

We could write get_expr() in a way that deals with multivariate functions by replacing ast.args[1][1] by ast.args[1]. For example,

f(x, y) = 3*x*y^2+x*exp(y);
m = methods(f, (Float64, Float64))[1][3];
ast = Base.uncompressed_ast(m);

ast.args[1]
2-element Any Array:
:x
:y

In order to understand the role of vars = ast.args[2][1], could you give me an example where the resulting Array{Any, 1} is non-empty? In other words, I would like to see through an example what information is stored in this array.
I was wondering if isactive() and find_dependents() are necessary in case we use ast.args[1] in get_expr(). Sorry if this is a naive question, I am still trying to digest the utility of these two functions.
The Calculus package offers the symbolic differentiate() function, nevertheless it uses finite differences inside the derivative() function. My point is that we can extend-wrap the differentiate() functions so as to implement symbolic differentiation in the context of AD (you are doing sth like this via Dual). I am thinking whether it is better to avoid using Dual numbers in source code transformation so as to avoid computing the value of f:R^n->R n times.

tshort commented 11 years ago

You're right that it would be better to use ast.args[1] rather than args[1][1]. I had forgotten that I had taken that shortcut.

ast.args[2][1] is not empty if you have local variables, like the following:

function f2(x, y)
     z = x + y
     3*x*z^2+x*exp(y)
end
m = methods(f2, (Float64, Float64))[1][3]
ast = Base.uncompress_ast(m)
ast.args[2]

I think you still need something like isactive and find_dependents if you use ast.args[1]. One reason is that you don't know what depends on what: is y a function of x or just some parameter. Another reason is finding dependencies among local variables.

I agree that we could extend Calculus.differentiate as long as it can handle dependencies. It does need to be extended to handle more than Dual numbers, so it doesn't recalculate things every time. The derivative may turn into a loop, though.

papamarkou commented 10 years ago

I will close this issue following the recent discussion in METADATA and the joint decision to split packages - source transform will be hosted on a separated package written by @fredo-dedup.

JuliaDiff / ForwardDiff.jl

source code transformation #1