Closed cortner closed 6 years ago
Yeah, indexing is a relatively new part of XGrad, so some woes are expected.
For normal arrays (e.g. x = rand(6)
) indexing already works on jacobian
branch:
f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
df = xdiff(f, x=rand(6))
df(rand(6)) # ==> (1.6760141328514941, [0.893547, 1.17716, 1.3425, 1.32344, 1.17759, 0.896482])
Static arrays aren't that easy though - current implementation of getindex
's gradient involves setindex!
which is not supported for static arrays, so we need to make a it a special case. Technically it's easy, but it also brings a number of design questions (e.g. should we depend on StaticArrays
?).
Just pushed another commit adding experimental support for StaticArrays
. Now the following works as well (again, on jacobian
branch).
using StaticArrays
using XGrad
f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
r = @SVector rand(6)
df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)
df(r)
codegen=VectorCodeGen()
is optional in this specific case, but other code may fail on static arrays without it. The difference between VectorCodeGen()
and BufCodeGen()
(default) is that the later uses a number of optimizations including memory pre-allocation and in-place operations which are not available for StaticArrays
. Later we may add a separate codegen specifically for static arrays so you don't have to specify it explicitly, but let's first make sure that this type of arrays is amenable for full differentiation at all.
I see. Since I have few variables, wiould it then be better to just write them out explicitly maybe
What do you mean by writing variables explicitly?
UPD. I just realized you use static arrays to group several variables. In this case you may want to use structs instead, e.g.:
julia> using XGrad
julia> struct S
a::Float64
b::Float64
c::Float64
end
julia> f(s::S) = s.a + s.b * s.c
f (generic function with 1 method)
julia> df = xdiff(f, s=S(1, 2, 3))
f_deriv_666 (generic function with 2 methods)
julia> df(S(3, 2, 1))
(5.0, S(1.0, 1.0, 2.0))
The result is another struct with each field representing derivative w.r.t. corresponding field in the input struct. I use this feature to pass ML models which normally have 5-20 parameters, so passing around all of them is quite tedious.
~EDIT: this looks like a problem at my end, maybe ignore for the moment.~ EDIT: the error message below I get consistently
Hm - if I run the example,
using StaticArrays
using XGrad
f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
r = @SVector rand(6)
df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)
I get
julia> df = XGrad.xdiff(f, ctx=Dict(:codegen => VectorCodeGen()), x=r)
ERROR: UndefVarError: tmp677 not defined
Stacktrace:
[1] eval(::Module, ::Any) at ./boot.jl:235
[2] (::XGrad.##21#22{Espresso.ExGraph})(::Symbol) at ./<missing>:0
[3] collect(::Base.Generator{Array{Symbol,1},XGrad.##21#22{Espresso.ExGraph}}) at ./array.jl:475
[4] rev_step!(::Espresso.ExGraph, ::Espresso.ExGraph, ::Espresso.ExNode{:call}) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:166
[5] reverse_pass!(::Espresso.ExGraph) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:230
[6] _xdiff(::Espresso.ExGraph) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:239
[7] #xdiff#29(::Dict{Any,Any}, ::Array{Any,1}, ::Function, ::Expr) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:277
[8] (::XGrad.#kw##xdiff)(::Array{Any,1}, ::XGrad.#xdiff, ::Expr) at ./<missing>:0
[9] #xdiff#32(::Dict{Symbol,Espresso.VectorCodeGen}, ::Array{Any,1}, ::Function, ::Function) at /Users/ortner/.julia/v0.6/XGrad/src/xdiff.jl:307
[10] (::XGrad.#kw##xdiff)(::Array{Any,1}, ::XGrad.#xdiff, ::Function) at ./<missing>:0
This is with Julia v0.6.2,
julia> Pkg.status("StaticArrays")
- StaticArrays 0.7.0
julia> Pkg.status("XGrad")
- XGrad 0.1.0+ jacobian
tbh, Id rather avoid the struct trick, it is actually much more natural to work with SVectors, but I may get back to it if nothing else will work.
Oops, I forgot to push the changes in Espresso.jl. Please, check out its latest master.
Also I recommend to add at least one empty line before a function to be differentiated as described in Code Discovery section of the docs. Normally it's not needed, but extracting sources of functions in Julia is quite fragile, so better to be on the safe side.
This all works now - thank you! But it is strangely slow due to many allocations?
@btime f($r)
@btime df($r)
2.265 ns (0 allocations: 0 bytes)
21.801 μs (283 allocations: 11.61 KiB)
Yes, as well as due to lots of unnecessary work. Let me explain it in a bit more detail.
In most ML applications (which I originally designed XGrad for) you only call getindex
once, e.g.:
A = ...
loss = A[1]
Derivative of scalar loss w.r.t. to array A
in this case is (roughly):
dloss!dA = zeros(size(A))
dloss!dA[1] = loss
So all but the first element of A
are zeros.
However, if the loss (i.e. output variable) is constructed from several elements of A
, code becomes quite non-optimal:
A = ...
loss = A[1] + A[2] * A[3]
dloss!dA__1 = zeros(size(A))
dloss!dA__1[1] = ...
dloss!dA__2 = zeros(size(A))
dloss!dA__2[2] == ...
dloss!dA__3 = zeros(size(A))
dloss!dA__3[3] == ...
# sum individual derivatives to obtain the whole dloss!dA
dloss!dA = dloss!dA__1 .+ dloss!dA__3 .+ dloss!dA__3
So for 3 components of A
we first allocated 3 zero arrays, filled in only one element and then combined these arrays together. No need to say it's highly non-optimal, especially for static arrays.
One way to overcome it is to use pure tuples instead which generate code similar to this:
A = ...
loss = A[1] + A[2] * A[3]
dloss!dA__1 = ...
dloss!dA__2 == ...
dloss!dA__3 == ...
dloss!dA = (dloss!dA__1, dloss!dA__3, dloss!dA__3)
and runs significantly faster than for static arrays:
using XGrad
using BenchmarkTools
f(x) = ((x[1]*x[2])*x[3] + (x[1]*x[4])*x[5]) + ((x[2]*x[4])*x[6] + (x[3]*x[5])*x[6])
r = tuple(rand(6)...)
df = XGrad.xdiff(f; x=r)
@btime f($r)
@btime df($r)
# 2.026 ns (0 allocations: 0 bytes)
# 131.121 ns (8 allocations: 864 bytes)
So it's ~170x faster than a version with static arrays, although still 65x slower than original function.
With a bit of semi-manual optimizations I can get the following expression:
function df_manual(x)
tmp917 = 3
dtmp946!dtmp927 = 1.0
tmp920 = 1
tmp931 = 4
tmp932 = x[tmp931]
tmp918 = x[tmp917]
tmp925 = 5
tmp926 = x[tmp925]
tmp929 = 2
tmp930 = x[tmp929]
tmp933 = tmp930 * tmp932
tmp913 = x[tmp920]
tmp934 = 6
tmp935 = x[tmp934]
tmp936 = tmp933 * tmp935
dtmp946!dtmp916 = tmp918 * dtmp946!dtmp927
dtmp946!dtmp915 = tmp913 * dtmp946!dtmp916
dtmp946!dtmp935 = tmp933 * dtmp946!dtmp927
dtmp946!dtmp913 = tmp930 * dtmp946!dtmp916
tmp916 = tmp913 * tmp930
dtmp946!dtmp918 = tmp916 * dtmp946!dtmp927
tmp919 = tmp916 * tmp918
tmp924 = tmp913 * tmp932
tmp927 = tmp924 * tmp926
tmp928 = tmp919 + tmp927
dtmp946!dtmp926 = tmp924 * dtmp946!dtmp927
dtmp946!dtmp924 = tmp926 * dtmp946!dtmp927
dtmp946!dtmp923 = tmp913 * dtmp946!dtmp924
dtmp946!dx = (dtmp946!dtmp913, dtmp946!dtmp915, dtmp946!dtmp918, dtmp946!dtmp923, dtmp946!dtmp926, dtmp946!dtmp935)
tmp941 = tmp918 * tmp926
tmp944 = tmp941 * tmp935
tmp945 = tmp936 + tmp944
tmp946 = tmp928 + tmp945
tmp964 = (tmp946, dtmp946!dx)
end
which runs in:
@btime df_manual($r)
# 9.744 ns (0 allocations: 0 bytes)
I think I can make it work that fast automatically for tuples. However, I'm really not sure about static arrays - they are somewhere in between normal arrays and tuples, and I don't see yet what's the best way to handle them.
Does it makes sense to you to utilize tuples instead of static arrays?
Hi, for my application I am perfectly happy with tuples. I'll give this a try.
You probably know this, but static arrays are basically just tuples: x.data
is a tuple.
Thanks for putting so much effort into this. I will try this for a while and report back if I run into serious problems.
You probably know this, but static arrays are basically just tuples: x.data is a tuple.
Yes - from implementation point of view, but not from the point of code generation in XGrad as you can see :)
BTW, please update jacobian
branch - I've just fixed some code for tuples there.
seems to be working quite well so far. And if this last performance problem is really fixable it would be great. Anything I can help with?
I notice that the exponents in expressions like $x[n]^p$ are all turned into variables. Does this not mean that you will call pow
instead of doing a fast integer exponentiation?
It isn't really a problem for me since I resolve the exponentiation manually, I was just wondering.
here is an observation I thought is interesting:
function ff(x)
x1 = x[1]
x2 = x[2]
x3 = x[3]
x4 = x[4]
x5 = x[5]
x6 = x[6]
((x1*x2)*x3 + (x1*x4)*x5) + ((x2*x4)*x6 + (x3*x5)*x6)
end
dff = XGrad.xdiff(ff, x=tuple(r...))
@btime dff($(tuple(r...)))
this times to ca 200ms instead of 380ms (original).
I don't mind at all converting my functions like that if it would help. So it's now a matter for me of finding out where the allocation is coming from?
I notice that the exponents in expressions like $x[n]^p$ are all turned into variables. Does this not mean that you will call pow instead of doing a fast integer exponentiation?
Julia compiler seems to be smart enough to handle both cases efficiently, at least I haven't noticed any difference in my experiments. If your results differ, please report and I'll incorporate a fix.
this times to ca 200ms instead of 380ms (original).
Hm, for me the difference is negligible - 131 and 129ns, on both - Julia 0.6.0 and 0.6.2. Are your benchmarks reproducible on your machine? Also, did you mean 200ns
instead of 200ms
(ns vs. ms)?
I don't mind at all converting my functions like that if it would help. So it's now a matter for me of finding out where the allocation is coming from?
It turns out most of this time comes from an unnecessary optional parameter in the generated function (recall that XGrad was designed for ML tasks where constant factors like optional parameters don't really matter). To overcome it, I've added (perhaps temporary) flag for skipping this optional argument. Please, checkout the jacobian
branch and run:
ctx = Dict{Any,Any}(:nomem => true)
df = XGrad.xdiff(f, ctx=ctx, x=tuple(r...))
@btime df($r)
# 22.954 ns (2 allocations: 128 bytes)
I get consistent timing: 32.384 ns (2 allocations: 128 bytes)
, just a slower machine I guess. Fantastic! ~I run into trouble with a bigger system though. Let me just double-check this and then post it here.~
So for scalar outputs, I think this is already very nice. Do you want to keep it open to discuss the last allocation that still messes up the performance a bit?
Yeah, let's keep it open for now. Also I don't want to merge it into master until a more future-proof way to specify code generation options (e.g. :nomem
) is found.
thank you for your efforts.
ill report back here if I find anything else.
I've just got back to this issue and found out that the left 2x difference comes from incorrect measurements. When I changed this:
@btime df($r)
to this (interpolating function itself):
@btime $df($r)
I've got the same 9.7ms as in completely manual version.
Closing this issue and moving to jacobians :)
I am opening this to ask for help with some issues that I have before I can go back to the Jacobians #15 .
Here is a very simple example of the kind of functions I am trying to differentiate:
(I put brackets so I don't have to declare new differentiation rules here, but normally I'd do that)
My first problem is:
which gives
Is there a work-around?