Closed GiggleLiu closed 5 years ago
So this is actually working entirely by accident; from Zygote's perspective a complex number is just a pair of reals (re, im)
, and the gradient value reflects that.
I'd be happy to help you start to add real support for this though. The first step would be to add gradient definitions for real
and imag
that return complex numbers rather than structs like this. Then we just need gradient definitions for things like log
. Many definitions – like abs2
– might work automatically if they are fairly simple.
The other thing apart from making this work is developing a real API for it; but see here for some good thinking on that.
I just added those definitions in https://github.com/FluxML/Zygote.jl/commit/f2a85600981b672d150f01dffa8b0163052c6353, so a couple of simple things work now.
julia> gradient(x -> abs2(log(x)), 1+2im)
(0.21478062759933578 - 0.29549955871733735im,)
Let me know if I can help with any more support for this stuff.
Zygote is like a magic! I sincerely hope it can be stablized and documented as soon as possible, so that we can try cool stuffs on it.
Since I am not familiar with this fresh new framework. Could you please explain a bit why it is able to treat complex as a tuple by accident? Is it decoding the underlying structure of an instance?
Glad you like it :)
That's the gist of it. As a source-to-source AD, Zygote sees (and has to differentiate through) everything in your code, including things like the creation of structs. So if you write something like
struct Foo; x; end
foo.x
This effectively has a gradient definition like
@grad getfield(f::Foo, :x) = f.x, dx -> ((x = dx,),)
Where we've by default decided that the "gradient" of the foo struct should be a named tuple. (You can of course change this, which is what I did in the complex numbers commit).
Even with the fix on master I think this is incorrect?
using Zygote
myeps = 1e-8
mydiff(f,z) = (f(z+myeps)-f(z))/myeps, (f(z+im*myeps)-f(z))/myeps
f(z) = abs2(log(z))
@show gradient(f,1.0+2im)
@show mydiff(f,1.0+2im)
gradient(f, 1.0 + 2im) = (1.2076065567220926 - 0.20091567785600395im,)
mydiff(f, 1.0 + 2im) = (-0.5638313371747472, 1.0866346933369186)
I'm not 100% sure how that @grad
thing works, but in general, the derivative of a complex to complex map such as log cannot be expressed using a single complex number (it's a 2x2 matrix). In the case of holomorphic functions it can, but C->R functions are not holomorphic and so reverse-mode AD should not take holomorphism into account (except possibly as an optimization); see more details in https://github.com/JuliaDiff/Capstan.jl/issues/1. Differentiating it as a tuple of reals is the way to go (at least at the beginning).
To clarify what's going on, a simple example : f(z) = imag(log(z))
at 1+im
, giving
gradient(f, 1.0 + 1im) = (0.5 + 0.5im,)
mydiff(f, 1.0 + 1im) = (-0.49999998585903427, 0.4999999969612645im)
The gradient computation is (with h an infinitesimal)
f(z+h) = imag(log(z+h)) = f(z) + imag(h/z) = f(z) + real(h)*imag(1/z) + imag(h)*real(1/z)
hence the gradient is (imag(1/z), real(1/z)) = -0.5, 0.5. I'm guessing currently the AD code does something like assuming imag is holomorphic, hence the error?
@antoine-levitt Thanks for explaination, but I think Zygote has already treated an imaginary number as a tuple of real numbers. The problem here is probably in log?
One thing we should notice is gradients are different from derivatives for complex numbers (the first equation in https://giggleliu.github.io/2018/02/01/complex_bp.html). Which means grad(log(x)) is 1/x* rather than 1/x, quite non-intuitive, right? This is exactly the question 3 wants to emphasis.
An easy fix should be wrapping a ‘conj’ for every holomophic functions gradients in backward pass. Hopping for real numbers, this wrapper can always be optimized away by compiler.
Indeed, that was the problem; the correct definition is
Zygote.@grad log(z::Complex) = (log(z), y -> (y*conj(1/z),))
I got confused by the mentions of holomorphism; this has nothing to do with holomorphism (it's just that for holomorphic functions the gradient operation happens to be a complex multiplication).
There's the possibility of nasty bugs here, with gradients defined with real numbers in mind getting used with complex numbers. As long as the real functions are typed appropriately this should be fine.
You are right, and it seems the function in https://github.com/FluxML/Zygote.jl/commit/f2a85600981b672d150f01dffa8b0163052c6353 should be fixed.
Yeah this is probably just a bug in the gradient definition. I've pushed a fix but I haven't been that careful about implementing this (we really need gradient checking tests and so on); I'm happy to help anyone who wants to hack on that or improve things generally.
Would this be expected to work in the latest release?
using Zygote
f(z) = z^2
gradient(f, 1+5im)
Currently it gives an error "Function output is not scalar" because of this check. Now I know the complex derivative isn't defined for non-holomorphic functions so I'm not sure if gradient
is expected to work, but in the case of holomorphic functions like this that do have a well-defined complex derivative it would be slick if the expected answer popped out.
edit: I no longer think it's a good idea to have gradient
work this way for holomorphic functions, see below
Seems fishy to use gradient
for holomorphic derivatives, since gradient
strongly suggests a different input and output space. For holomorphic derivatives you probably want forward-mode anyway?
I think it'd be fine to just change that restriction to Number
and have this work. I'm a little wary of that being misleading, given that people sometimes effectively want a Jacobian, but it's an obvious enough generalisation that it might be ok, and you can always build anything else on top of that.
Bump. I Really would like this to happen. And from past discussion I seem to recall that for holomorphic functions the current infrastructure is mostly capable of handling complex number.
The following errors on Zygote#master
using Zygote #(master)
f(z)=z^2
y, back = Zygote.forward(x->f(x), 1.0+5.0im)
back(1)
julia> back(1)
ERROR: MethodError: no method matching (::Zygote.Jnew{Complex{Float64},Nothing,false})(::Int64)
Closest candidates are:
Jnew(::Union{Nothing, NamedTuple}) where {T, G} at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/lib/lib.jl:209
Stacktrace:
[1] (::getfield(Zygote, Symbol("##319#back#187")){Zygote.Jnew{Complex{Float64},Nothing,false}})(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/lib/grad.jl:45
[2] Type at ./complex.jl:12 [inlined]
[3] (::typeof(∂(Complex{Float64})))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
[4] Type at ./complex.jl:12 [inlined]
[5] (::typeof(∂(Complex)))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
[6] * at ./complex.jl:268 [inlined]
[7] (::typeof(∂(*)))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
[8] literal_pow at ./intfuncs.jl:243 [inlined]
[9] f at ./none:1 [inlined]
[10] (::typeof(∂(f)))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
[11] #11 at ./none:1 [inlined]
[12] (::typeof(∂(getfield(Main, Symbol("##11#12"))())))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
[13] (::getfield(Zygote, Symbol("##71#72")){typeof(∂(getfield(Main, Symbol("##11#12"))()))})(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface.jl:38
[14] top-level scope at none:0
If some changes to master are needed and they are not extraordinarily complicated, if you guide me I'm more than happy to do it myself.
Fixed that example on https://github.com/FluxML/Zygote.jl/commit/2c30968365c08a893b4dcf022a15887800e1b979; I just forgot the gradient for the Complex
constructor earlier. I also did the ::Number
change there.
julia> gradient(x -> x^2, 1+5im)
(2 - 10im,)
I'm closing this issue for now, I think the core is there now (broadcast works too thanks to #100) so we can open new ones for any specific things that break.
I think this is a really bad idea. It will silently fail if the function is not holomorphic:
julia> gradient(x -> x^2, 1+5im)
(2 - 10im,)
julia> gradient(x -> conj(x)^2, 1+5im)
(2 - 10im,)
or even if the function is holomorphic but has non-holomorphic parts inside:
julia> gradient(x -> (real(x)+im*imag(x))^2, 1+5im)
(4 - 20im,)
The only target this could plausibly aim at is the optimized computation of holomorphic C^n to C functions (https://en.wikipedia.org/wiki/Several_complex_variables). This seems to me to be a marginal use case compared to computation of derivatives of holomorphic C->C functions (better tackled by forwarddiff) or gradients of C^n->R functions (which already works), and can be done efficiently by computing the gradient of the real part and using the Cauchy-Riemann equations.
The second example is a bug that would have come up anyway, I pushed a fix.
julia> gradient(x -> (real(x)+imag(x)*im)^2, 1+5im)
(2 - 10im,)
In general any issues that come up with this are going to come up if you take the gradient of real(f(x))
; the question really comes down to whether this is the only reasonable interpretation when the output is complex (as opposed to whether that use case is in itself common or not, since if you wanted those other things you'd write them down differently).
For the conj(x)
example, what would be your expected output?
I just don't think gradients of complex-valued functions should be defined at all. Assuming that you still want to do it, there are three cases, for scalar-valued functions:
Given a C^n to C function, how do you choose what to return between the last two? This change chose version 2. Given this, there are two options: either blacklist non-holomorphic operations (real, imag, conj, abs) on complex numbers, or accept the fact that a wrong result will be silently returned on functions that are non-holomorphic in any part of their computational graph, even if they are themselves holomorphic (the (re(z)+im(z))^2
above). Both these options seem to me unreasonable, especially when users can get holomorphic derivatives easily from the gradient of z -> real(f(z))
and the Cauchy-Riemann equations.
This is why I think the gradient of complex-valued functions should just be left undefined. Users can choose if they want to compute jacobians or holomorphic derivatives themselves.
In general any issues that come up with this are going to come up if you take the gradient of real(f(x));
Why? The gradient of a C^n to R function is well-defined, and follows by just seeing C as R^2.
Hmm, I just tried and:
julia> gradient(x -> abs2(x), 1+5im)
(2 - 10im,)
This is wrong, according to the only "reasonable" definition of gradient for a C^n to R function (nabla_f[i] = re(df/dxi)+im(df/dxi)*i). This is the definition that follows from seeing ComplexF64 as a struct of two reals, and the definition that is useful for e.g. gradient descent optimization.
Ok, so we need to be careful about any non-holomorphic functions in DiffRules. Fixed in https://github.com/FluxML/Zygote.jl/commit/e1b00f78665486343d397a82590c27aa736e8bce.
I agree that neither of those options are good, but I don't yet see that they are inevitable. Let's zero in on:
a wrong result will be silently returned on functions that are non-holomorphic in any part of their computational graph, even if they are themselves holomorphic (the
(re(z)+im(z))^2
above).
I'd like to see an example, work through any issues with our current gradient definitions and see that there's a fundamental conflict. I don't currently see how this can be the case: there's only one gradient for a holomorphic function, and if Zygote doesn't get that right then it's a bug.
Related to the just submitted fix in DiffRules: Is it really possible to approach this with the suggested black list? Isn't a white list a better solution as most functions will not be holomorphic?
To the more general question: I would be eager to provide examples, but as @antoine-levitt alluded to, currently Zygote does not seem to have an official stance on what is the definition of "derivative of C^n -> C" function. One definition is "a 2n by 2 Jacobian of real values" (it is the only general one I can think of). Other definitions have to deal with "is the function holomorphic or not".
In particular, here are some confusing results on current master:
All of these should be 2x2 real matrices in my opinion:
julia> gradient(z -> real(z), 10+20im) #
(1 + 0im,)
julia> gradient(z -> imag(z), 10+20im) # I do not understand this result
(0 + 1im,)
julia> gradient(z -> real(z)+imag(z), 10+20im) # this seems definitely wrong
(1 + 1im,)
Compare these two, which are supposedly the same function, but one of them is explicitly holomorphic
julia> gradient(z -> z, 10+20im) # makes sense if you decide "Zygote derivative" means "complex derivative for holomorphic functions"
(1,)
julia> gradient(z -> real(z)+imag(z), 10+20im) # should this be considered a holomorphic function?
(1 + 1im,)
I'd like to see an example, work through any issues with our current gradient definitions and see that there's a fundamental conflict. I don't currently see how this can be the case: there's only one gradient for a holomorphic function, and if Zygote doesn't get that right then it's a bug.
I think I disagree with this statement. There are derivative-like operations that make sense in mathematical sets that have nice structure (e.g. lie derivatives or directional derivatives in differential geometry; complex derivatives of holomorphic functions in complex analysis). These derivative-like operations share the important property that they are basis-independent. When implemented on a computer (unless we are using symbolic algebra systems), we usually choose a basis (e.g. x+iy or r*exp(iφ) for complex analysis). The moment we have chosen a basis, all the nice basis-independent entities can be expressed numerically, but we also loose all promises that the derivative-like operations continue to make sense.
The real numbers are an exception, because they are 1D where these issues simply do not arise. For everything else we have to explicitly choose to interpret expressions in basis-independent form (e.g. holomorphic functions act like that) or as basis-dependent form (e.g. using real and imag and conj). There is no general way to know whether a particular basis-dependent expression is actually equivalent to a basis-independent one. It is in general unsolvable to know whether a mixture or imag/real/conj/abs/angle ends up being a holomorphic function.
Edit: The use of the word "basis" above was mathematically imprecise / abusive. The correct term would be a "coordinate patch" or something similar.
The cases you've listed are actually the easy ones: if the function output l
is real then the gradient of x+yi
is perfectly well defined as the pair dl/dx, dl/dy
(which we can also express as a complex number). The Jacobian would be trivial here (second column is zero). This is also identical to treating Complex
as if it were a struct of two reals; being that way is a design constraint since Zygote mechanically composes adjoint operations for machine numbers, and isn't designed to handle arbitrarily abstract notions of a derivative. I don't think the real-valued-function case is under issue here, only the complex-valued one.
Zygote's current position on complex gradients is that gradient
is actually just sugar for
y, pullback = forward(f, x)
dx = pullback(1)
This naturally generalises to complex numbers; it's actually disabling them that involves a special case. You can also think of this as being the gradient for real(f(x)
or the first column of the Jacobian; to get the other column we'd use pullback(im)
. Antoine's concerns notwithstanding, I don't think the issue here is whether this is a valid generalisation, but whether it's potentially confusing given that users may expect a different generalisation.
The only other option we have, AFAICT, is to return a Jacobian. But I think this has to be an explicit, separate function. Reasoning: writing gradient(x -> f(x) + 0im, x)
shouldn't totally change the structure that gradient
returns, it should either be essentially equivalent to the f(x)
case or error out. So I think those are our two options. I'm not that opposed to making people write real(f(x))
if they want the current behaviour, but I'd also like to properly understand the concerns around doing that by default.
Thanks, that clarified some things.
The gradient as defined for a C -> R
function is confusing to me, but it is indeed consistent and well defined. I guess it is useful for gradient descent, which is probably the main use case.
The derivative of explicitly holomorphic C -> C
functions currently works as expected in complex analysis. It is not an obvious generalization of the C -> R
case and it is useless for gradient descent (as you can not order complex numbers), but as long as it does not conflict with other things, I guess it is a nice thing to have.
The intermediate regime between the two (functions that are holomorphic, but the way they are written does not make that explicit) is very confusing for me... Could you please help me reason through this example (I think it is a bug):
# this is the z->z function written in polar coordinates
julia> gradient(z -> sqrt(abs2(z))*exp(1im*angle(z)), 10+20im)
(0.19999999999999996 + 0.4000000000000001im,)
EDIT: there were a couple more examples, that were not actually as problematic. I erased them in order to not distract.
To rephrase, for C -> C
function Zygote currently does two different things and it is not obvious how it chooses which one to do.
It either returns pullback(1)
which does not seem useful (as you already mentioned, pullback(im)
is also necessary). Maybe gradient
raising an error would make more sense for those cases.
Or it returns the holomorphic derivative, which is useful for holomorphic functions. This also happens to be the first Wirtinger derivative. And it happens to be a sum of pullback(1)
and pullback(im)
.
In the language of this post, currently Zygote switches between the "realistic" and the "native" view of the problem, and it is not clear how it makes that choice: http://www.ekinakyurek.com/complex-derivatives-wirtinger/
Maybe gradient
of C->C
should always raise an error and the error message should say "use complex_jacobian
or wirtingen_derivatives
".
complex_jacobian = pullback(1), pullback(im)
wirtingen_derivatives = (pullback(1) - im pullback(im))/2, (pullback(1) + im pullback(im))/2
I'd like to see an example, work through any issues with our current gradient definitions and see that there's a fundamental conflict. I don't currently see how this can be the case: there's only one gradient for a holomorphic function, and if Zygote doesn't get that right then it's a bug.
Hm, thinking a bit more, I see I was straw-manning against something that was proposed for forwarddiff, and so I was confused: if done correctly, Zygote will actually do the right thing: (implicitly) propagate the 2x2 jacobian inside the computational graph, and only assume holomorphicness for the last step. This is actually quite clever. Then OK, this thing should actually work: it's just that for non-holomorphic functions it will not return the full derivative info (that would be the 2x2 jacobian). This has the potential to be very confusing and should be documented appropriately at some point, but at least it's consistent. Now there's just to make sure it works correctly.
Ok, great, I'm glad we're all on roughly the same page – this has certainly been an enlightening discussion for me. @Krastanov yes, that was indeed a bug which is fixed now. I'll also cross reference that ForwardDiff issue: https://github.com/JuliaDiff/ForwardDiff.jl/issues/157.
To summarise: we have two derivatives we might be interested in, the Wirtinger derivatives and the "Jacobi" ones (i.e. columns of the equivalent R^2 -> R^2
Jacobian; what Zygote natively operates on). Function output can be:
-i
).Hopefully it's clear that Zygote isn't switching behaviour at any point here, it just happens that the current Jacobi derivative lines up with more standard things where they exist. Ideally, where it doesn't (complex non-holomorphic), we'd just throw an error, but of course we can't tell automatically.
One thing that might be confusing is that the gradient (as defined here, ie the gradient of the real part of f) of a holomorphic function is the conjugate of the usual derivative. As long as this is well-documented somewhere it's fine.
Yes, @ssfrr just raised that point on slack as well. It's an extra complication. I think this is probably in some sense consistent with the fact that Zygote produces adjoints, but we should think through it carefully.
Hope he won't mind me reposting his thoughts:
- C^n->R functions have a gradient but no derivative
- C^n->C^m holomorphic functions have a derivative but no gradient (even for m==1)
- C->C nonholomorphic functions have neither a gradient nor a complex derivative, but you can capture their local linearization with a pair of complex derivatives (the Wirtinger approach) or a 2x2 real jacobian
- I need to think more about the C^n->C^m nonholomorphic case
I’ve been thinking of proposing separate
derivative
andgradient
functions that would capture these cases and throw the appropriate errors (I think you can catch an attemptedderivative
on nonholomorphic functions by checking the Cauchy-Riemann equations at the end). The nice thing is thatderivative
generalizes nicely between real/complex/scalar/vector (see my post from yesterday with the 3 cases)
I think it makes some sense for gradient
to expose what Zygote is doing in a fairly raw way, and have things like jacobian
and derivative
return more mathematically interesting things on top of that. Including taking adjoints where necessary, and so on.
non-holomorphic C^n -> C. There the derivative is a 2n x 2 jacobian.
Complex non-holomorphic. Trickier: the holomorphic derivative is not allowed, and Jacobi and Wirtinger diverge.
Normally, what we care about is the back propagation of adjoint
. In which case the loss must be Real. Then there is no intrinsic hardness.
Consider a function z = f(x)
, where x ~ C^n
and z ~ C^m
. Given adjoint(z) := 2(dL/dz*)
, we know dL/dz = adoint(z)*/2
immediately. With both dL/dz
and dL/dz*
, it is easy to obtain adjoint(x) = 2(dL/dx*)
. I wonder why it is nessesary to consider an ambitious C^n -> C
autodiff?
BTW: Is conj(x)
supported now? (I can not precompile master branch now...)
Can't agree more on seperating derivative
and gradient
.
@antoine-levitt @MikeInnes
Yes, to be clear the core technical problem is solved, and so gradients of Real
functions are done. I don't think anything is going to change there. conj
works now as well as a bunch of other things. It's more of an interface issue about how/if we expose other kinds of derivatives.
From @MikeInnes
I think it makes some sense for gradient to expose what Zygote is doing in a fairly raw way
I'd think gradient
would throw an error for functions that don't return Real
. I'm not sure what a gradient means for functions returning Complex
, in the same way that it's not meaningful for functions with nonscalar output.
Yes, I've disabled that in https://github.com/FluxML/Zygote.jl/commit/d82581f21823daff5afc81b43d24ad91b6c75d51 (that statement wasn't meant as an argument for or against that choice.). Although I've specifically disabled Complex
, rather than all non-Real
, as things like measurements or TaylorSeries can still be useful.
Great, I checked several cases quickly (master branch), the gradient turns out to be correct. Can't wait to try out new features. :+1:
@MikeInnes, thank you, this is pretty amazing. And yes, your last post made it very clear that "it's clear that Zygote isn't switching behaviour at any point here".
Ok, we now have docs up that clarify these things: http://fluxml.ai/Zygote.jl/dev/complex/
Thanks again for the discussion, and please do report any bugs!
I find
Zygote
supports complex number AD, I like it more now! e.g.But I have several questions,
Question 1:
Why it returns a
NamedTuple
rather than a complex number.Question 2:
Why the following expression complaints not implemented error
log
is holomophic and complex gradient forabs
is already defined in Zygote, there is no reason to break here. I suggest to create some traits for functions (is it possible?!) to help program knowlog
is holomophic.Question 3:
About convention, is the way of
Zygote
defining gradients for complex numbers proper? We have two conventions hereI think they are conjugate to each other.
I have implementation of the latter one here https://github.com/GiggleLiu/poorman_nn/blob/44b54d4a86f689f528a301e3a4db6b05210bb16a/poornn/functions.py#L1034
I choose the latter convention because we want
gradient(log, x) = log'(x) = 1/x
rather thangradient(log, x) = log'(x)' = 1/x^*
, it is more convenient for holomophic functions, right?