FluxML / Zygote.jl

21st century AD
https://fluxml.ai/Zygote.jl/
Other
1.48k stars 213 forks source link

Automatic differenciation for complex numbers #29

Closed GiggleLiu closed 5 years ago

GiggleLiu commented 6 years ago

I find Zygote supports complex number AD, I like it more now! e.g.

julia> gradient(imag, x)
((re = nothing, im = 1),)

But I have several questions,

Question 1:

Why it returns a NamedTuple rather than a complex number.

Question 2:

Why the following expression complaints not implemented error

gradient(x-> log(x) |> abs, x)

log is holomophic and complex gradient for abs is already defined in Zygote, there is no reason to break here. I suggest to create some traits for functions (is it possible?!) to help program know log is holomophic.

Question 3:

About convention, is the way of Zygote defining gradients for complex numbers proper? We have two conventions here

I think they are conjugate to each other.

I have implementation of the latter one here https://github.com/GiggleLiu/poorman_nn/blob/44b54d4a86f689f528a301e3a4db6b05210bb16a/poornn/functions.py#L1034

I choose the latter convention because we want gradient(log, x) = log'(x) = 1/x rather than gradient(log, x) = log'(x)' = 1/x^*, it is more convenient for holomophic functions, right?

MikeInnes commented 6 years ago

So this is actually working entirely by accident; from Zygote's perspective a complex number is just a pair of reals (re, im), and the gradient value reflects that.

I'd be happy to help you start to add real support for this though. The first step would be to add gradient definitions for real and imag that return complex numbers rather than structs like this. Then we just need gradient definitions for things like log. Many definitions – like abs2 – might work automatically if they are fairly simple.

The other thing apart from making this work is developing a real API for it; but see here for some good thinking on that.

MikeInnes commented 6 years ago

I just added those definitions in https://github.com/FluxML/Zygote.jl/commit/f2a85600981b672d150f01dffa8b0163052c6353, so a couple of simple things work now.

julia> gradient(x -> abs2(log(x)), 1+2im)
(0.21478062759933578 - 0.29549955871733735im,)

Let me know if I can help with any more support for this stuff.

GiggleLiu commented 6 years ago

Zygote is like a magic! I sincerely hope it can be stablized and documented as soon as possible, so that we can try cool stuffs on it.

Since I am not familiar with this fresh new framework. Could you please explain a bit why it is able to treat complex as a tuple by accident? Is it decoding the underlying structure of an instance?

MikeInnes commented 6 years ago

Glad you like it :)

That's the gist of it. As a source-to-source AD, Zygote sees (and has to differentiate through) everything in your code, including things like the creation of structs. So if you write something like

struct Foo; x; end
foo.x

This effectively has a gradient definition like

@grad getfield(f::Foo, :x) = f.x, dx -> ((x = dx,),)

Where we've by default decided that the "gradient" of the foo struct should be a named tuple. (You can of course change this, which is what I did in the complex numbers commit).

antoine-levitt commented 6 years ago

Even with the fix on master I think this is incorrect?

using Zygote

myeps = 1e-8
mydiff(f,z) = (f(z+myeps)-f(z))/myeps, (f(z+im*myeps)-f(z))/myeps

f(z) = abs2(log(z))
@show gradient(f,1.0+2im)
@show mydiff(f,1.0+2im)
gradient(f, 1.0 + 2im) = (1.2076065567220926 - 0.20091567785600395im,)
mydiff(f, 1.0 + 2im) = (-0.5638313371747472, 1.0866346933369186)

I'm not 100% sure how that @grad thing works, but in general, the derivative of a complex to complex map such as log cannot be expressed using a single complex number (it's a 2x2 matrix). In the case of holomorphic functions it can, but C->R functions are not holomorphic and so reverse-mode AD should not take holomorphism into account (except possibly as an optimization); see more details in https://github.com/JuliaDiff/Capstan.jl/issues/1. Differentiating it as a tuple of reals is the way to go (at least at the beginning).

antoine-levitt commented 6 years ago

To clarify what's going on, a simple example : f(z) = imag(log(z)) at 1+im, giving

gradient(f, 1.0 + 1im) = (0.5 + 0.5im,)
mydiff(f, 1.0 + 1im) = (-0.49999998585903427, 0.4999999969612645im)

The gradient computation is (with h an infinitesimal)

f(z+h) = imag(log(z+h)) = f(z) + imag(h/z) = f(z) + real(h)*imag(1/z) + imag(h)*real(1/z)

hence the gradient is (imag(1/z), real(1/z)) = -0.5, 0.5. I'm guessing currently the AD code does something like assuming imag is holomorphic, hence the error?

GiggleLiu commented 6 years ago

@antoine-levitt Thanks for explaination, but I think Zygote has already treated an imaginary number as a tuple of real numbers. The problem here is probably in log?

One thing we should notice is gradients are different from derivatives for complex numbers (the first equation in https://giggleliu.github.io/2018/02/01/complex_bp.html). Which means grad(log(x)) is 1/x* rather than 1/x, quite non-intuitive, right? This is exactly the question 3 wants to emphasis.

GiggleLiu commented 6 years ago

An easy fix should be wrapping a ‘conj’ for every holomophic functions gradients in backward pass. Hopping for real numbers, this wrapper can always be optimized away by compiler.

antoine-levitt commented 6 years ago

Indeed, that was the problem; the correct definition is

Zygote.@grad log(z::Complex) = (log(z), y -> (y*conj(1/z),))

I got confused by the mentions of holomorphism; this has nothing to do with holomorphism (it's just that for holomorphic functions the gradient operation happens to be a complex multiplication).

There's the possibility of nasty bugs here, with gradients defined with real numbers in mind getting used with complex numbers. As long as the real functions are typed appropriately this should be fine.

GiggleLiu commented 6 years ago

You are right, and it seems the function in https://github.com/FluxML/Zygote.jl/commit/f2a85600981b672d150f01dffa8b0163052c6353 should be fixed.

MikeInnes commented 6 years ago

Yeah this is probably just a bug in the gradient definition. I've pushed a fix but I haven't been that careful about implementing this (we really need gradient checking tests and so on); I'm happy to help anyone who wants to hack on that or improve things generally.

ssfrr commented 5 years ago

Would this be expected to work in the latest release?

using Zygote

f(z) = z^2
gradient(f, 1+5im)

Currently it gives an error "Function output is not scalar" because of this check. Now I know the complex derivative isn't defined for non-holomorphic functions so I'm not sure if gradient is expected to work, but in the case of holomorphic functions like this that do have a well-defined complex derivative it would be slick if the expected answer popped out.

edit: I no longer think it's a good idea to have gradient work this way for holomorphic functions, see below

antoine-levitt commented 5 years ago

Seems fishy to use gradient for holomorphic derivatives, since gradient strongly suggests a different input and output space. For holomorphic derivatives you probably want forward-mode anyway?

MikeInnes commented 5 years ago

I think it'd be fine to just change that restriction to Number and have this work. I'm a little wary of that being misleading, given that people sometimes effectively want a Jacobian, but it's an obvious enough generalisation that it might be ok, and you can always build anything else on top of that.

PhilipVinc commented 5 years ago

Bump. I Really would like this to happen. And from past discussion I seem to recall that for holomorphic functions the current infrastructure is mostly capable of handling complex number.

The following errors on Zygote#master

using Zygote #(master)
f(z)=z^2
y, back = Zygote.forward(x->f(x), 1.0+5.0im)
back(1)
julia> back(1)
ERROR: MethodError: no method matching (::Zygote.Jnew{Complex{Float64},Nothing,false})(::Int64)
Closest candidates are:
  Jnew(::Union{Nothing, NamedTuple}) where {T, G} at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/lib/lib.jl:209
Stacktrace:
 [1] (::getfield(Zygote, Symbol("##319#back#187")){Zygote.Jnew{Complex{Float64},Nothing,false}})(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/lib/grad.jl:45
 [2] Type at ./complex.jl:12 [inlined]
 [3] (::typeof(∂(Complex{Float64})))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
 [4] Type at ./complex.jl:12 [inlined]
 [5] (::typeof(∂(Complex)))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
 [6] * at ./complex.jl:268 [inlined]
 [7] (::typeof(∂(*)))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
 [8] literal_pow at ./intfuncs.jl:243 [inlined]
 [9] f at ./none:1 [inlined]
 [10] (::typeof(∂(f)))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
 [11] #11 at ./none:1 [inlined]
 [12] (::typeof(∂(getfield(Main, Symbol("##11#12"))())))(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface2.jl:0
 [13] (::getfield(Zygote, Symbol("##71#72")){typeof(∂(getfield(Main, Symbol("##11#12"))()))})(::Int64) at /Users/filippovicentini/.julia/packages/Zygote/VpeBE/src/compiler/interface.jl:38
 [14] top-level scope at none:0

If some changes to master are needed and they are not extraordinarily complicated, if you guide me I'm more than happy to do it myself.

MikeInnes commented 5 years ago

Fixed that example on https://github.com/FluxML/Zygote.jl/commit/2c30968365c08a893b4dcf022a15887800e1b979; I just forgot the gradient for the Complex constructor earlier. I also did the ::Number change there.

julia> gradient(x -> x^2, 1+5im)
(2 - 10im,)

I'm closing this issue for now, I think the core is there now (broadcast works too thanks to #100) so we can open new ones for any specific things that break.

antoine-levitt commented 5 years ago

I think this is a really bad idea. It will silently fail if the function is not holomorphic:

julia> gradient(x -> x^2, 1+5im)
(2 - 10im,)

julia> gradient(x -> conj(x)^2, 1+5im)
(2 - 10im,)

or even if the function is holomorphic but has non-holomorphic parts inside:

julia> gradient(x -> (real(x)+im*imag(x))^2, 1+5im)
(4 - 20im,)

The only target this could plausibly aim at is the optimized computation of holomorphic C^n to C functions (https://en.wikipedia.org/wiki/Several_complex_variables). This seems to me to be a marginal use case compared to computation of derivatives of holomorphic C->C functions (better tackled by forwarddiff) or gradients of C^n->R functions (which already works), and can be done efficiently by computing the gradient of the real part and using the Cauchy-Riemann equations.

MikeInnes commented 5 years ago

The second example is a bug that would have come up anyway, I pushed a fix.

julia> gradient(x -> (real(x)+imag(x)*im)^2, 1+5im)
(2 - 10im,)

In general any issues that come up with this are going to come up if you take the gradient of real(f(x)); the question really comes down to whether this is the only reasonable interpretation when the output is complex (as opposed to whether that use case is in itself common or not, since if you wanted those other things you'd write them down differently).

For the conj(x) example, what would be your expected output?

antoine-levitt commented 5 years ago

I just don't think gradients of complex-valued functions should be defined at all. Assuming that you still want to do it, there are three cases, for scalar-valued functions:

Given a C^n to C function, how do you choose what to return between the last two? This change chose version 2. Given this, there are two options: either blacklist non-holomorphic operations (real, imag, conj, abs) on complex numbers, or accept the fact that a wrong result will be silently returned on functions that are non-holomorphic in any part of their computational graph, even if they are themselves holomorphic (the (re(z)+im(z))^2 above). Both these options seem to me unreasonable, especially when users can get holomorphic derivatives easily from the gradient of z -> real(f(z)) and the Cauchy-Riemann equations.

This is why I think the gradient of complex-valued functions should just be left undefined. Users can choose if they want to compute jacobians or holomorphic derivatives themselves.

In general any issues that come up with this are going to come up if you take the gradient of real(f(x));

Why? The gradient of a C^n to R function is well-defined, and follows by just seeing C as R^2.

antoine-levitt commented 5 years ago

Hmm, I just tried and:

julia> gradient(x -> abs2(x), 1+5im)
(2 - 10im,)

This is wrong, according to the only "reasonable" definition of gradient for a C^n to R function (nabla_f[i] = re(df/dxi)+im(df/dxi)*i). This is the definition that follows from seeing ComplexF64 as a struct of two reals, and the definition that is useful for e.g. gradient descent optimization.

MikeInnes commented 5 years ago

Ok, so we need to be careful about any non-holomorphic functions in DiffRules. Fixed in https://github.com/FluxML/Zygote.jl/commit/e1b00f78665486343d397a82590c27aa736e8bce.

I agree that neither of those options are good, but I don't yet see that they are inevitable. Let's zero in on:

a wrong result will be silently returned on functions that are non-holomorphic in any part of their computational graph, even if they are themselves holomorphic (the (re(z)+im(z))^2 above).

I'd like to see an example, work through any issues with our current gradient definitions and see that there's a fundamental conflict. I don't currently see how this can be the case: there's only one gradient for a holomorphic function, and if Zygote doesn't get that right then it's a bug.

Krastanov commented 5 years ago

Related to the just submitted fix in DiffRules: Is it really possible to approach this with the suggested black list? Isn't a white list a better solution as most functions will not be holomorphic?

To the more general question: I would be eager to provide examples, but as @antoine-levitt alluded to, currently Zygote does not seem to have an official stance on what is the definition of "derivative of C^n -> C" function. One definition is "a 2n by 2 Jacobian of real values" (it is the only general one I can think of). Other definitions have to deal with "is the function holomorphic or not".

In particular, here are some confusing results on current master:

All of these should be 2x2 real matrices in my opinion:

julia> gradient(z -> real(z), 10+20im) # 
(1 + 0im,)

julia> gradient(z -> imag(z), 10+20im) # I do not understand this result
(0 + 1im,)

julia> gradient(z -> real(z)+imag(z), 10+20im) # this seems definitely wrong
(1 + 1im,)

Compare these two, which are supposedly the same function, but one of them is explicitly holomorphic

julia> gradient(z -> z, 10+20im) # makes sense if you decide "Zygote derivative" means "complex derivative for holomorphic functions"
(1,)

julia> gradient(z -> real(z)+imag(z), 10+20im) # should this be considered a holomorphic function?
(1 + 1im,)
Krastanov commented 5 years ago

I'd like to see an example, work through any issues with our current gradient definitions and see that there's a fundamental conflict. I don't currently see how this can be the case: there's only one gradient for a holomorphic function, and if Zygote doesn't get that right then it's a bug.

I think I disagree with this statement. There are derivative-like operations that make sense in mathematical sets that have nice structure (e.g. lie derivatives or directional derivatives in differential geometry; complex derivatives of holomorphic functions in complex analysis). These derivative-like operations share the important property that they are basis-independent. When implemented on a computer (unless we are using symbolic algebra systems), we usually choose a basis (e.g. x+iy or r*exp(iφ) for complex analysis). The moment we have chosen a basis, all the nice basis-independent entities can be expressed numerically, but we also loose all promises that the derivative-like operations continue to make sense.

The real numbers are an exception, because they are 1D where these issues simply do not arise. For everything else we have to explicitly choose to interpret expressions in basis-independent form (e.g. holomorphic functions act like that) or as basis-dependent form (e.g. using real and imag and conj). There is no general way to know whether a particular basis-dependent expression is actually equivalent to a basis-independent one. It is in general unsolvable to know whether a mixture or imag/real/conj/abs/angle ends up being a holomorphic function.

Edit: The use of the word "basis" above was mathematically imprecise / abusive. The correct term would be a "coordinate patch" or something similar.

MikeInnes commented 5 years ago

The cases you've listed are actually the easy ones: if the function output l is real then the gradient of x+yi is perfectly well defined as the pair dl/dx, dl/dy (which we can also express as a complex number). The Jacobian would be trivial here (second column is zero). This is also identical to treating Complex as if it were a struct of two reals; being that way is a design constraint since Zygote mechanically composes adjoint operations for machine numbers, and isn't designed to handle arbitrarily abstract notions of a derivative. I don't think the real-valued-function case is under issue here, only the complex-valued one.

Zygote's current position on complex gradients is that gradient is actually just sugar for

y, pullback = forward(f, x)
dx = pullback(1)

This naturally generalises to complex numbers; it's actually disabling them that involves a special case. You can also think of this as being the gradient for real(f(x) or the first column of the Jacobian; to get the other column we'd use pullback(im). Antoine's concerns notwithstanding, I don't think the issue here is whether this is a valid generalisation, but whether it's potentially confusing given that users may expect a different generalisation.

The only other option we have, AFAICT, is to return a Jacobian. But I think this has to be an explicit, separate function. Reasoning: writing gradient(x -> f(x) + 0im, x) shouldn't totally change the structure that gradient returns, it should either be essentially equivalent to the f(x) case or error out. So I think those are our two options. I'm not that opposed to making people write real(f(x)) if they want the current behaviour, but I'd also like to properly understand the concerns around doing that by default.

Krastanov commented 5 years ago

Thanks, that clarified some things.

The gradient as defined for a C -> R function is confusing to me, but it is indeed consistent and well defined. I guess it is useful for gradient descent, which is probably the main use case.

The derivative of explicitly holomorphic C -> C functions currently works as expected in complex analysis. It is not an obvious generalization of the C -> R case and it is useless for gradient descent (as you can not order complex numbers), but as long as it does not conflict with other things, I guess it is a nice thing to have.

The intermediate regime between the two (functions that are holomorphic, but the way they are written does not make that explicit) is very confusing for me... Could you please help me reason through this example (I think it is a bug):

# this is the z->z function written in polar coordinates
julia> gradient(z -> sqrt(abs2(z))*exp(1im*angle(z)), 10+20im)
(0.19999999999999996 + 0.4000000000000001im,)

EDIT: there were a couple more examples, that were not actually as problematic. I erased them in order to not distract.

Krastanov commented 5 years ago

To rephrase, for C -> C function Zygote currently does two different things and it is not obvious how it chooses which one to do.

It either returns pullback(1) which does not seem useful (as you already mentioned, pullback(im) is also necessary). Maybe gradient raising an error would make more sense for those cases.

Or it returns the holomorphic derivative, which is useful for holomorphic functions. This also happens to be the first Wirtinger derivative. And it happens to be a sum of pullback(1) and pullback(im).

In the language of this post, currently Zygote switches between the "realistic" and the "native" view of the problem, and it is not clear how it makes that choice: http://www.ekinakyurek.com/complex-derivatives-wirtinger/

Krastanov commented 5 years ago

Maybe gradient of C->C should always raise an error and the error message should say "use complex_jacobian or wirtingen_derivatives".

complex_jacobian = pullback(1), pullback(im)

wirtingen_derivatives = (pullback(1) - im pullback(im))/2, (pullback(1) + im pullback(im))/2

antoine-levitt commented 5 years ago

I'd like to see an example, work through any issues with our current gradient definitions and see that there's a fundamental conflict. I don't currently see how this can be the case: there's only one gradient for a holomorphic function, and if Zygote doesn't get that right then it's a bug.

Hm, thinking a bit more, I see I was straw-manning against something that was proposed for forwarddiff, and so I was confused: if done correctly, Zygote will actually do the right thing: (implicitly) propagate the 2x2 jacobian inside the computational graph, and only assume holomorphicness for the last step. This is actually quite clever. Then OK, this thing should actually work: it's just that for non-holomorphic functions it will not return the full derivative info (that would be the 2x2 jacobian). This has the potential to be very confusing and should be documented appropriately at some point, but at least it's consistent. Now there's just to make sure it works correctly.

MikeInnes commented 5 years ago

Ok, great, I'm glad we're all on roughly the same page – this has certainly been an enlightening discussion for me. @Krastanov yes, that was indeed a bug which is fixed now. I'll also cross reference that ForwardDiff issue: https://github.com/JuliaDiff/ForwardDiff.jl/issues/157.

To summarise: we have two derivatives we might be interested in, the Wirtinger derivatives and the "Jacobi" ones (i.e. columns of the equivalent R^2 -> R^2 Jacobian; what Zygote natively operates on). Function output can be:

Hopefully it's clear that Zygote isn't switching behaviour at any point here, it just happens that the current Jacobi derivative lines up with more standard things where they exist. Ideally, where it doesn't (complex non-holomorphic), we'd just throw an error, but of course we can't tell automatically.

antoine-levitt commented 5 years ago

One thing that might be confusing is that the gradient (as defined here, ie the gradient of the real part of f) of a holomorphic function is the conjugate of the usual derivative. As long as this is well-documented somewhere it's fine.

MikeInnes commented 5 years ago

Yes, @ssfrr just raised that point on slack as well. It's an extra complication. I think this is probably in some sense consistent with the fact that Zygote produces adjoints, but we should think through it carefully.

Hope he won't mind me reposting his thoughts:

  • C^n->R functions have a gradient but no derivative
  • C^n->C^m holomorphic functions have a derivative but no gradient (even for m==1)
  • C->C nonholomorphic functions have neither a gradient nor a complex derivative, but you can capture their local linearization with a pair of complex derivatives (the Wirtinger approach) or a 2x2 real jacobian
  • I need to think more about the C^n->C^m nonholomorphic case

I’ve been thinking of proposing separate derivative and gradient functions that would capture these cases and throw the appropriate errors (I think you can catch an attempted derivative on nonholomorphic functions by checking the Cauchy-Riemann equations at the end). The nice thing is that derivative generalizes nicely between real/complex/scalar/vector (see my post from yesterday with the 3 cases)

I think it makes some sense for gradient to expose what Zygote is doing in a fairly raw way, and have things like jacobian and derivative return more mathematically interesting things on top of that. Including taking adjoints where necessary, and so on.

GiggleLiu commented 5 years ago

non-holomorphic C^n -> C. There the derivative is a 2n x 2 jacobian.

Complex non-holomorphic. Trickier: the holomorphic derivative is not allowed, and Jacobi and Wirtinger diverge.

Normally, what we care about is the back propagation of adjoint. In which case the loss must be Real. Then there is no intrinsic hardness.

Consider a function z = f(x), where x ~ C^n and z ~ C^m. Given adjoint(z) := 2(dL/dz*), we know dL/dz = adoint(z)*/2 immediately. With both dL/dz and dL/dz*, it is easy to obtain adjoint(x) = 2(dL/dx*). I wonder why it is nessesary to consider an ambitious C^n -> C autodiff?

BTW: Is conj(x) supported now? (I can not precompile master branch now...)

Can't agree more on seperating derivative and gradient.

@antoine-levitt @MikeInnes

MikeInnes commented 5 years ago

Yes, to be clear the core technical problem is solved, and so gradients of Real functions are done. I don't think anything is going to change there. conj works now as well as a bunch of other things. It's more of an interface issue about how/if we expose other kinds of derivatives.

ssfrr commented 5 years ago

From @MikeInnes

I think it makes some sense for gradient to expose what Zygote is doing in a fairly raw way

I'd think gradient would throw an error for functions that don't return Real. I'm not sure what a gradient means for functions returning Complex, in the same way that it's not meaningful for functions with nonscalar output.

MikeInnes commented 5 years ago

Yes, I've disabled that in https://github.com/FluxML/Zygote.jl/commit/d82581f21823daff5afc81b43d24ad91b6c75d51 (that statement wasn't meant as an argument for or against that choice.). Although I've specifically disabled Complex, rather than all non-Real, as things like measurements or TaylorSeries can still be useful.

GiggleLiu commented 5 years ago

Great, I checked several cases quickly (master branch), the gradient turns out to be correct. Can't wait to try out new features. :+1:

Krastanov commented 5 years ago

@MikeInnes, thank you, this is pretty amazing. And yes, your last post made it very clear that "it's clear that Zygote isn't switching behaviour at any point here".

MikeInnes commented 5 years ago

Ok, we now have docs up that clarify these things: http://fluxml.ai/Zygote.jl/dev/complex/

Thanks again for the discussion, and please do report any bugs!