dfdx / XGrad.jl

eXpression gradients in Julia
Other
3 stars 4 forks source link

Indexing and other woes #16

Closed cortner closed 6 years ago

cortner commented 6 years ago

I am opening this to ask for help with some issues that I have before I can go back to the Jacobians #15 .

Here is a very simple example of the kind of functions I am trying to differentiate:

f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])

(I put brackets so I don't have to declare new differentiation rules here, but normally I'd do that)

My first problem is:

using StaticArrays
import XGrad
f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
r = @SVector rand(6)
XGrad.xdiff(f, x=ra)  # or x = rand(6)

which gives

ERROR: AssertionError: Currently only indexing of tuples issupported, but got Array{Float64,1}

Is there a work-around?

dfdx commented 6 years ago

Yeah, indexing is a relatively new part of XGrad, so some woes are expected.

For normal arrays (e.g. x = rand(6)) indexing already works on jacobian branch:

f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
df = xdiff(f, x=rand(6))
df(rand(6))   # ==> (1.6760141328514941, [0.893547, 1.17716, 1.3425, 1.32344, 1.17759, 0.896482]) 

Static arrays aren't that easy though - current implementation of getindex's gradient involves setindex! which is not supported for static arrays, so we need to make a it a special case. Technically it's easy, but it also brings a number of design questions (e.g. should we depend on StaticArrays?).

dfdx commented 6 years ago

Just pushed another commit adding experimental support for StaticArrays. Now the following works as well (again, on jacobian branch).

using StaticArrays
using XGrad

f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])

r = @SVector rand(6)
df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)    

df(r)

codegen=VectorCodeGen() is optional in this specific case, but other code may fail on static arrays without it. The difference between VectorCodeGen() and BufCodeGen() (default) is that the later uses a number of optimizations including memory pre-allocation and in-place operations which are not available for StaticArrays. Later we may add a separate codegen specifically for static arrays so you don't have to specify it explicitly, but let's first make sure that this type of arrays is amenable for full differentiation at all.

cortner commented 6 years ago

I see. Since I have few variables, wiould it then be better to just write them out explicitly maybe

dfdx commented 6 years ago

What do you mean by writing variables explicitly?

UPD. I just realized you use static arrays to group several variables. In this case you may want to use structs instead, e.g.:

julia> using XGrad

julia> struct S
           a::Float64
           b::Float64
           c::Float64
       end

julia> f(s::S) = s.a + s.b * s.c
f (generic function with 1 method)

julia> df = xdiff(f, s=S(1, 2, 3))
f_deriv_666 (generic function with 2 methods)

julia> df(S(3, 2, 1))
(5.0, S(1.0, 1.0, 2.0))

The result is another struct with each field representing derivative w.r.t. corresponding field in the input struct. I use this feature to pass ML models which normally have 5-20 parameters, so passing around all of them is quite tedious.

cortner commented 6 years ago

~EDIT: this looks like a problem at my end, maybe ignore for the moment.~ EDIT: the error message below I get consistently

Hm - if I run the example,

using StaticArrays
using XGrad
f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
r = @SVector rand(6)
df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)

I get

julia> df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)
ERROR: UndefVarError: tmp677 not defined
Stacktrace:
 [1] eval(::Module, ::Any) at ./boot.jl:235
 [2] (::XGrad.##21#22{Espresso.ExGraph})(::Symbol) at ./<missing>:0
 [3] collect(::Base.Generator{Array{Symbol,1},XGrad.##21#22{Espresso.ExGraph}}) at ./array.jl:475
 [4] rev_step!(::Espresso.ExGraph, ::Espresso.ExGraph, ::Espresso.ExNode{:call}) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:166
 [5] reverse_pass!(::Espresso.ExGraph) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:230
 [6] _xdiff(::Espresso.ExGraph) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:239
 [7] #xdiff#29(::Dict{Any,Any}, ::Array{Any,1}, ::Function, ::Expr) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:277
 [8] (::XGrad.#kw##xdiff)(::Array{Any,1}, ::XGrad.#xdiff, ::Expr) at ./<missing>:0
 [9] #xdiff#32(::Dict{Symbol,Espresso.VectorCodeGen}, ::Array{Any,1}, ::Function, ::Function) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:307
 [10] (::XGrad.#kw##xdiff)(::Array{Any,1}, ::XGrad.#xdiff, ::Function) at ./<missing>:0

This is with Julia v0.6.2,

julia> Pkg.status("StaticArrays")
 - StaticArrays                  0.7.0
julia> Pkg.status("XGrad")
 - XGrad                         0.1.0+             jacobian
cortner commented 6 years ago

tbh, Id rather avoid the struct trick, it is actually much more natural to work with SVectors, but I may get back to it if nothing else will work.

dfdx commented 6 years ago

Oops, I forgot to push the changes in Espresso.jl. Please, check out its latest master.

Also I recommend to add at least one empty line before a function to be differentiated as described in Code Discovery section of the docs. Normally it's not needed, but extracting sources of functions in Julia is quite fragile, so better to be on the safe side.

cortner commented 6 years ago

This all works now - thank you! But it is strangely slow due to many allocations?

@btime f($r)
@btime df($r)
  2.265 ns (0 allocations: 0 bytes)
  21.801 μs (283 allocations: 11.61 KiB)
dfdx commented 6 years ago

Yes, as well as due to lots of unnecessary work. Let me explain it in a bit more detail.

In most ML applications (which I originally designed XGrad for) you only call getindex once, e.g.:

A = ...
loss = A[1]

Derivative of scalar loss w.r.t. to array A in this case is (roughly):

dloss!dA = zeros(size(A))
dloss!dA[1] = loss

So all but the first element of A are zeros.

However, if the loss (i.e. output variable) is constructed from several elements of A, code becomes quite non-optimal:

A = ...
loss = A[1] + A[2] * A[3]

dloss!dA__1 = zeros(size(A))
dloss!dA__1[1] = ...

dloss!dA__2 = zeros(size(A))
dloss!dA__2[2] == ...

dloss!dA__3 = zeros(size(A))
dloss!dA__3[3] == ...

# sum individual derivatives to obtain the whole dloss!dA
dloss!dA = dloss!dA__1 .+ dloss!dA__3 .+ dloss!dA__3

So for 3 components of A we first allocated 3 zero arrays, filled in only one element and then combined these arrays together. No need to say it's highly non-optimal, especially for static arrays.

One way to overcome it is to use pure tuples instead which generate code similar to this:

A = ...
loss = A[1] + A[2] * A[3]

dloss!dA__1 = ...
dloss!dA__2 == ...
dloss!dA__3 == ...

dloss!dA = (dloss!dA__1, dloss!dA__3, dloss!dA__3)

and runs significantly faster than for static arrays:

using XGrad
using BenchmarkTools

f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])

r = tuple(rand(6)...)
df = XGrad.xdiff(f; x=r)
@btime f($r)
@btime df($r)

# 2.026 ns (0 allocations: 0 bytes)
# 131.121 ns (8 allocations: 864 bytes)

So it's ~170x faster than a version with static arrays, although still 65x slower than original function.

With a bit of semi-manual optimizations I can get the following expression:

function df_manual(x)
    tmp917 = 3
    dtmp946!dtmp927 = 1.0
    tmp920 = 1
    tmp931 = 4
    tmp932 = x[tmp931]
    tmp918 = x[tmp917]
    tmp925 = 5
    tmp926 = x[tmp925]
    tmp929 = 2
    tmp930 = x[tmp929]
    tmp933 = tmp930 * tmp932
    tmp913 = x[tmp920]
    tmp934 = 6
    tmp935 = x[tmp934]
    tmp936 = tmp933 * tmp935
    dtmp946!dtmp916 = tmp918 * dtmp946!dtmp927
    dtmp946!dtmp915 = tmp913 * dtmp946!dtmp916
    dtmp946!dtmp935 = tmp933 * dtmp946!dtmp927
    dtmp946!dtmp913 = tmp930 * dtmp946!dtmp916
    tmp916 = tmp913 * tmp930
    dtmp946!dtmp918 = tmp916 * dtmp946!dtmp927
    tmp919 = tmp916 * tmp918
    tmp924 = tmp913 * tmp932
    tmp927 = tmp924 * tmp926
    tmp928 = tmp919 + tmp927
    dtmp946!dtmp926 = tmp924 * dtmp946!dtmp927
    dtmp946!dtmp924 = tmp926 * dtmp946!dtmp927
    dtmp946!dtmp923 = tmp913 * dtmp946!dtmp924
    dtmp946!dx = (dtmp946!dtmp913, dtmp946!dtmp915, dtmp946!dtmp918, dtmp946!dtmp923, dtmp946!dtmp926, dtmp946!dtmp935)
    tmp941 = tmp918 * tmp926
    tmp944 = tmp941 * tmp935
    tmp945 = tmp936 + tmp944
    tmp946 = tmp928 + tmp945
    tmp964 = (tmp946, dtmp946!dx)
end

which runs in:

@btime df_manual($r)
#  9.744 ns (0 allocations: 0 bytes)

I think I can make it work that fast automatically for tuples. However, I'm really not sure about static arrays - they are somewhere in between normal arrays and tuples, and I don't see yet what's the best way to handle them.

Does it makes sense to you to utilize tuples instead of static arrays?

cortner commented 6 years ago

Hi, for my application I am perfectly happy with tuples. I'll give this a try.

You probably know this, but static arrays are basically just tuples: x.data is a tuple.

Thanks for putting so much effort into this. I will try this for a while and report back if I run into serious problems.

dfdx commented 6 years ago

You probably know this, but static arrays are basically just tuples: x.data is a tuple.

Yes - from implementation point of view, but not from the point of code generation in XGrad as you can see :)

BTW, please update jacobian branch - I've just fixed some code for tuples there.

cortner commented 6 years ago

seems to be working quite well so far. And if this last performance problem is really fixable it would be great. Anything I can help with?

cortner commented 6 years ago

I notice that the exponents in expressions like $x[n]^p$ are all turned into variables. Does this not mean that you will call pow instead of doing a fast integer exponentiation?

It isn't really a problem for me since I resolve the exponentiation manually, I was just wondering.

cortner commented 6 years ago

here is an observation I thought is interesting:

function ff(x)
   x1 = x[1]
   x2 = x[2]
   x3 = x[3]
   x4 = x[4]
   x5 = x[5]
   x6 = x[6]
   ((x1*x2)*x3 + (x1*x4)*x5) + ((x2*x4)*x6 + (x3*x5)*x6)
end
dff = XGrad.xdiff(ff, x=tuple(r...))
@btime dff($(tuple(r...)))

this times to ca 200ms instead of 380ms (original).

cortner commented 6 years ago

I don't mind at all converting my functions like that if it would help. So it's now a matter for me of finding out where the allocation is coming from?

dfdx commented 6 years ago

I notice that the exponents in expressions like $x[n]^p$ are all turned into variables. Does this not mean that you will call pow instead of doing a fast integer exponentiation?

Julia compiler seems to be smart enough to handle both cases efficiently, at least I haven't noticed any difference in my experiments. If your results differ, please report and I'll incorporate a fix.

this times to ca 200ms instead of 380ms (original).

Hm, for me the difference is negligible - 131 and 129ns, on both - Julia 0.6.0 and 0.6.2. Are your benchmarks reproducible on your machine? Also, did you mean 200ns instead of 200ms (ns vs. ms)?

I don't mind at all converting my functions like that if it would help. So it's now a matter for me of finding out where the allocation is coming from?

It turns out most of this time comes from an unnecessary optional parameter in the generated function (recall that XGrad was designed for ML tasks where constant factors like optional parameters don't really matter). To overcome it, I've added (perhaps temporary) flag for skipping this optional argument. Please, checkout the jacobian branch and run:

ctx = Dict{Any,Any}(:nomem => true)
df = XGrad.xdiff(f, ctx=ctx, x=tuple(r...))
@btime df($r)
#  22.954 ns (2 allocations: 128 bytes)
cortner commented 6 years ago

I get consistent timing: 32.384 ns (2 allocations: 128 bytes), just a slower machine I guess. Fantastic! ~I run into trouble with a bigger system though. Let me just double-check this and then post it here.~

cortner commented 6 years ago

So for scalar outputs, I think this is already very nice. Do you want to keep it open to discuss the last allocation that still messes up the performance a bit?

dfdx commented 6 years ago

Yeah, let's keep it open for now. Also I don't want to merge it into master until a more future-proof way to specify code generation options (e.g. :nomem) is found.

cortner commented 6 years ago

thank you for your efforts.

ill report back here if I find anything else.

dfdx commented 6 years ago

I've just got back to this issue and found out that the left 2x difference comes from incorrect measurements. When I changed this:

@btime df($r)

to this (interpolating function itself):

@btime $df($r)

I've got the same 9.7ms as in completely manual version.

Closing this issue and moving to jacobians :)