Automatic Differentiation

jramapuram commented 9 years ago

This would be cool: http://en.wikipedia.org/wiki/Automatic_differentiation Theano uses this.

This should be a secondary objective. Initially (and for performance reasons) it will make sense to provide the derivatives of functions, however this feature will provide extensibility for later unknown model types.

futurely commented 9 years ago

The cost is longer compile time which is very counter-productivity especially for quick iterations of research and prototyping.

jramapuram commented 9 years ago

Yup, mentioned above already

sherjilozair commented 9 years ago

There is nothing fundamental about automatic differentiation that associates it with longer compile times. Some folks from Berkeley are implementing an alternative to Theano (rll.berkeley.edu/cgt/), which essentially solves the long compilation problem.

botev commented 8 years ago

So I've actually have already implemented a baby step of this, however an old version of it in Rust. As mentioned in #28 I find Arrayfire the perfect backend for this. Haven't check CGT by myself, but I can see it is still python based. I prefer a C++ version of this rather than python, and later it could be ported to other languages easily. However it requires also and some knowledge and expectation of the backend. I'm quite happy to work on it but wanted to hear what you guys think. Another nice part is that you can apply methods such as CRISP(https://cs.nyu.edu/web/Research/TechReports/TR1995-698/TR1995-698.pdf) which can benefit both for paralelization and memory usage on the device.

There isn't a specific literature on the topic, since it is quite simple. Have the optimization of the graph is not so trivial to do in constant number of passes. This usually include mathematical identity simplification, groupings, possible register allocations (or blobs of memory) and deciding which operations can be inlined. Any way I think for ML this is the standard since it gives you order of magnitute benefit on prototyping and research as well as run time speed. If the graph optimization is made well compilation times are going to be a few seconds longer, which if you get a boost in run time does not really matter.

jramapuram commented 8 years ago

I also suggest this post: http://colah.github.io/posts/2015-08-Backprop/index.html The automatic differentiation doesn't need to be extremely fast. Most of the time all you need is symbolic derivatives after which you can substitute values back into the result (at a rapid rate).

botev commented 8 years ago

Mm in general you would want to do source generation, otherwise just substitution will mean that you will have to do manual GC on the graph, rather than let the compiler do what it does best.

I was wondering if I implement this on a purely graph based structure, what are the important parts of each node that would be required for ArrayFire to know? Usually each node has some meta data like:

ndim
shape (not necessarily constant might depend on some inputs)
broadcastable dimensions
operator (obviously)

Would it be required to explicitly know the device on which the memory will live or anything else that I'm missing (here please excuse me, but I really never used ArrayFire before, just yd started trying it).

pavanky commented 8 years ago

@Botev We do most of what you suggest internally inside arrayfire's JIT engine. @jramapuram opened a ticket on upstream arrayfire library to perform Automatic differentiation.

I am fairly new to the subject. All I've seen / read so far is what other DNN libraries are doing to a certain extent. However I haven't yet figured out the scope of operations to support.

botev commented 8 years ago

Hmm, so could you point me to a document so that I can understand better what happens under the hood of array fire JIT. Are you building a symbolic graph of the computation? Or are you referring to storing the dim, shape, broadcasts etc of each variable? I must say that I'm really quite far from low level HPC, so have no idea how you guys do the computation inside. To most extend what other DNN libraries do is defacto the same, just have different ways of expressing it. The most general currently is Theano, which represents gradients as operations on the graph (adding nodes), which imho is the best, since it allows to easily later to derive things like Hessain vector products in a very natural way. In terms of operations to be supported, in general the more you have the better. Ah, I can write lots on this topic, but as I said I'm a noob on HPC and have no idea how you guys manage stuff under the hood.

jramapuram commented 8 years ago

I think the tensorflow implementation is far superior to theano. Here is their whitepaper: http://download.tensorflow.org/paper/whitepaper2015.pdf

pavanky commented 8 years ago

@Botev It is internal to arrayfire so it is not very well documented..

This file is representative of how a node is constructed: https://github.com/arrayfire/arrayfire/blob/devel/src/backend/cuda/binary.hpp There are other types of nodes that expect a different input but create a node on the graph.
This file shows how the graph is traversed and the (nvvm) code is generated: https://github.com/arrayfire/arrayfire/blob/devel/src/backend/cuda/jit.cpp
This directory (only the .hpp files) have implementation of different types of nodes https://github.com/arrayfire/arrayfire/tree/devel/src/backend/cuda/JIT

pavanky commented 8 years ago

If I can understand the theory well enough, I think jit.cpp and header files inside JIT directory can be extended to implement automatic differentiation.

botev commented 8 years ago

@jramapuram The automatic differentiation part is the same. If you look at section 4.1 and observe a Theano graph with gradients you will see that the way that gradients are symbolically computed are exactly the same (this is also true for torch, cgt and almost any other good DNN library out there). The difference is in how they manage the exact computation, code generation, compilation etc later

jramapuram commented 8 years ago

@Botev : sure, in the end you are just running reverse mode differentiation, and using the sum & product rule of graphs to get the gradient of the destination node w.r.t. the head node. But as you mention there are a lot of features that they support that optimize things (eg: partial graph utilization, etc).

Edit: another is when to use forward mode vs. reverse mode diff

botev commented 8 years ago

In ML, specially Deep Learning, which are usually the main targets for such libraries, these are more or less sugar, which to help do things or investigate after training. Whether, I will construct a brand new graph or reuse the old one, imho is not so important. The device management and execution however, can have significant impact, which is why I in first place came here. You are right that they have a bit more things, however they have and less - there a number of operators they do not support on the gpu or do not have. Additionally, I do not know (did not find any resources) of what exactly optimizations they perform on the graph in order to compare it with other frameworks.

jramapuram commented 8 years ago

Yea, you are right in that some operators are not supported on the GPU (try using the ADAM optimizer, there is a mean() call somewhere that seems to need CPU), however it's really interesting how they allow for 'loose' computations which dynamically toggle the backend. Of course the performance suffers from this. From this regard I think arrayfire has a huge advantage as there are implementations of basically all reduction algorithms you might need, etc.

Anyhow, in general it would be a very useful feature to have in the main library in general. Graph derivatives are useful in plenty of places besides just ML. Here it the other ticket in the upstream that I submitted: https://github.com/arrayfire/arrayfire/issues/1098

pavanky commented 8 years ago

Hmm one problem I can think of is that matrix multiply is not part of the JIT engine in arrayfire. Only element wise operations are part of it.

botev commented 8 years ago

Aren't those at least reference there for calls to the BLAS or some routines?

pavanky commented 8 years ago

@Botev any chance you are on gitter? I'd like to discuss some of this via chat if you can drop by

arrayfire / arrayfire-ml

Automatic Differentiation #2