schlichtanders commented 8 years ago

This is an awesome project and I hope it will mature even more together with julia's growth! Thanks for having realized the state it is now.

Some Introductory Words

I like to contribute in the long run! probably I won't have much time soon, as I will start my first job as a Data Scientist. But I truly love this general direction. In fact, I was motivated to implement something almost exactly like it myself. And where motivation is, there is time.

My motivation

The last half year I extensively worked with Python's Theano and loved its idea from the very beginning. Soon I got to know many of its drawbacks, where its awesome basic idea of graph-optimization and automatic differentiation are not put straightforward but wrapped in an enormously big code base.

The major drawback was that I had to rely on Theano and its C-code, completely. Combining theano with native Python always means that your code is not optimal in terms of performance... I don't like that as it takes away much of the control I would like to have as a programmer. Also the implementation of theano is rather horrible as they have custom Tree-Type with hidden references to another, which makes it really hard to cut, merge, ... (sub)trees. (With tree I just mean Graph).

I wanted to have clear interaction focus on the computation graphs themselves, without giving up control.

Julia

Julia is one of the ideal languages to implement such ideas, as there is no need to compile the graphs into C-code or something (like is done in Python's theano) and thereby taking away much control of the programmer. Instead in Julia everything can be compiled to Julia because of its meta system.

Furthermore, Julia's meta system already gives a default graph structure, which even allows for annotations as far as I understood it (for instance quote println("hello world") end or dump(quote println("hello world") end) already show annotations). As far as I read, julia's llvm even performs certain graph optimizations. Unfortunately I still haven't understood the details here.

As I see from your code https://github.com/JuliaDiff/ReverseDiffSource.jl/blob/master/src/frdiff.jl there is already a way to extract the source graph from function objects programmatically, which enables one to indeed work on the julia expression graphs. As you do it.

Some Suggestions

Suggesting use of common Graph structures

From my experience, the interface of theano and tensor-flow is useful until a certain, especially for fast scripting some high level stuff. But for any serious gradient-based machine learning stuff the programmer wants to have control over the computation graph directly (at least this is my experience).

I thought a lot about which graph type would be preferable for representing all this. You made your own type, because of the above reason, I would recommend using julias expression types directly I think. Some additional reasons:

almost everyone using Julia, at least those who write macros (which will be a lot users) are used to the expression tree of julia and hence automatically able to interact appropriately with the structure
any helper functionalities written for expression can be used here too (and also the other way around) - even more, these helpers might become commonly known within the community. As I got it, Julia itself is build following this aim. It collects a bunch of helper functionalities of which many are easily implemented by yourself. The aim is clearly to favour a common interface.

However, there seem to be certain graph optimizations which need a double-linked graph. At least in Python theano there is also both, a single linked graph (like Julia's expression graphs) and a double-linked one which is used for graph optimizations. I also thought a lot about which double-linke graph would be preferable. My conclusion was that I would follow Julia's Parametric Types suggestion for containers. I.e. that a generic double-linked tree is used with its value being parametrized. (Next to generalizability and style, a direct advantage of this direction is that the trees can be easily copied and combined, something which seems to be an internal mess within Theano.

This tree structure would of course made its own DoubleLinkedTree.js git repository. Such a collection type doesn't exist yet as far as I can see. (Or maybe better to take the time here and build a Tree.js or Graph.js with DoubleLinkedTree and SingleLinkedTree. Everything for Julia.)

As I see in your code, this is not yet build this way, however you seemed to have extracted the tree structure from the values and the derivation/optimization functionalities (please correct me here as I was not able to read into your code into depth yet). Furthermore, In your code you seemd to use even TripleLinkedTree by referring to precedences https://github.com/JuliaDiff/ReverseDiffSource.jl/blob/master/src/node.jl Would be great if you can explain why you need this, as I haven't seen such in Theano's tree structures.

Suggesting re-implementing theano's graph-optimizations in a separate respository

Theano (and probably also tensor-flow) already collected a lot of optimization methods. So there is no need to be creative, but the work is almost done already, but needs to be rewritten into julia. See https://github.com/Theano/Theano/blob/master/theano/tensor/opt.py for the main script of theano's graph-optimizer (if I got it right).

There are other suggested optimizations which I found while reading Julia's documentation. See http://docs.julialang.org/en/release-0.5/manual/performance-tips/#tweaks and http://docs.julialang.org/en/release-0.5/stdlib/math/#Base.muladd for instance.

It would be great, to expose the graph-optimization to the programmer, in that there are

helper functions to apply optimizers to graphs, returning optimized graphs
different default lists of optimizers (like it is done in Python with FAST_RUN and FAST_COMPILE for instance)
possibility to easily add new custom optimizers (which is kind of already realized by the previous two points)

In this vision graph-optimization is something complementary to automatic differentiation and hence the suggestion to keep those two functionalities separate. Probably it is even apt to create a new repository for this (This would be inline with the intention of JuliaDiff I think). Graph-Optimization would be useful to many Julia-users, even if they don't need AD.

Suggestion for replicating theano's / tensor-flows functionality, once the graph interface is there

There is an easy way to replicate machine learning functionality implemented in theano/tensor-flow by simply bookkeeping certain references to the complete computation graph, i.e. sub-expressions. I implemented such an approach on top of theano, called theano_models (note that the documentation still needs improvement). The central element is just dictionary like object. For example let y = :($a * $x + $b) i.e. x, a, b refer to sub-expressions. Then

linear_model = Dict(
  :inputs=>[x],
  :outputs=>y,
  :parameters=>[a, b]
)

would nicely represent the structure needed for machine learning. What is done in tensor-flow by labeling expressions as variables is much more generally implemented this way, by simply merging references of different sub-models. One advantage I already used is that you simply can add new keys, e.g. :parameters_positive, which might collect all references to parameters which need to stay positive.

Having all the references to expressions and the ability to automatically derive the respective expressions for the gradients is all what is to implement an elegant and maximal generic machine learning interface.

Indeed, I would like to implement this on top of your package. I already thought a lot about what would be a nice implementation and think I found one in my mentioned python version theano_models. Neverthless, It will still take time to re-implement it in Julia.

Suggestion for current improvement

From my perspective, the implementation of rdiff as returning a single expression or function which in turn returns tuples seems impressively unintuitive. I would suggest to return a tuple of expressions / tuple of functions instead. For instance in machine learning it is often not needed and not wanted to evaluate the function value directly during learning, but the gradient is enough.

Questions

What does precedences? Why have you called it *Reverse*DiffSource.js? I am just curious

Looking forward to your comments and to help you in the Future

fredo-dedup commented 8 years ago

Thanks a lot for your interest @schlichtanders ! I look forward to seeing where your ideas could lead to. I'll be glad to answer your questions (if I can) about the code of this package and have you contribute to it. But note that the current state of it is the result of several rewrites and many hacks, I lack the vision and background in programming languages to proceed from clean principles and reach a finished product ! So be warned that it may turn out to be very complicated to build anything from it.

Now about your questions :

precedence was added in during the latest rewrite of the package when I realized half-way that some dependencies of calculations were not caught in the primary graph (I think it was for setindex! functions). It resulted in a lot of complications. By the way, the rewrite I am slowly going through (branch devl3) is doing without it which indicates that it is not necessary to differentiate code, it is rather the result of the implementation.
"Reverse.." in "ReverseDiffSource" is for the differentiation method chosen which proceeds from the result and goes up to the initial variables, also called "reverse accumulation". An alternate way is to go from the variables to the end result, which is called logically "forward accumulation" (see https://en.wikipedia.org/wiki/Automatic_differentiation#The_chain_rule.2C_forward_and_reverse_accumulation).
Choice of return values (tuples) : tuple is the way for functions to return several values in julia, and it is nothing more here. When only the gradient is needed (or generally the last order), you can set the option allorders=false and the function/expression produced will provide it and nothing else.

schlichtanders commented 8 years ago

already some time is gone, thank you very much for the comments.

I fear it is already time for me to say see you some day. As it is a rather big project for me to read into the code base and start contributing I cannot promise when my context will allow for this. I keep it in the back of my mind and wish you success!

It is so awesome that julia is able to include automatic differentiation at its meta level!

JuliaAttic / ReverseDiffSource.jl

possible future directions #43