FluxML / Zygote.jl

21st century AD
https://fluxml.ai/Zygote.jl/
Other
1.49k stars 211 forks source link

Inconsistent Behaviour in Broadcasting #1036

Open willtebbutt opened 3 years ago

willtebbutt commented 3 years ago

As of version 0.6.15, the following:

julia> Zygote.gradient(^, 0.0, 0.9)
(0.0, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> f(x, y) = x ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [NaN, NaN, NaN])

The same code in version 0.6.14:

julia> Zygote.gradient(^, 0.0, 0.9)
(0.0, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> f(x, y) = x ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

the crucial difference being in the last case for each.

If you do this with ForwardDiff, you get

julia> ForwardDiff.gradient(x -> x[1]^x[2], [0.0, 0.9])
2-element Vector{Float64}:
  Inf
 NaN

so this smells to me like ForwardDiff is being invoked for f, whereas individual chain rules are used for sum(x .^ y) because Zygote un-fuses broadcasting where it can (it can't un-fuse f).

@oxinabox @mcabbott @DhairyaLGandhi I'm assuming that this is related to our recent efficiency upgrades?

This is breaking the build in KernelFunctions.jl. We can of course allow the tests to fail for the time being, but it essentially means that we can't use Zygote with one of our kernels.

willtebbutt commented 3 years ago

I guess the culprit is the discrepancy between the version of the rule DiffRules and ChainRules

In the latter we adopt a different convention from the former.

DhairyaLGandhi commented 3 years ago

Hmm, using DiffRules seems to be more "correct". Thoughts?

willtebbutt commented 3 years ago

Unclear to me -- the "subgradient" convention adopted in ChainRules works perfectly in the KernelFunctions use-case. Also, it's what we do for lots of other things (ReLUs, abs, etc). My guess would be it's a useful thing to do?

willtebbutt commented 3 years ago

Hmm, using DiffRules seems to be more "correct". Thoughts?

Hmm, I interpreted this as meaning that you believe the convention adopted in DiffRules is more appropriate than the one in ChainRules. Was I correct in this interpretation?

oxinabox commented 3 years ago

Returning NaN, or Inf for this is just causing pain with little gain, for no real reason. 0.0 is fine

Considering x^y for x=0, and y=0.9 The deriviative wrt x from below is not defined since the function errors if we make a small negative change, so we don't care about that case. The deriviative wrt x from above is zero. Thus we might as well say the derivative is zero.

The deriviative wrt y from below is Inf The deriviative wrt y from above is zero. I would argue that around if a<=b etc we are largely free to chose which branch we want for a==b (there needs to be some provisos on that statement) And here chosing zero is the choice that leads to more useful behavour. Noone want to work with Inf.

This also has the attractive property that if you enumerate all points that the derivative is zero, i.e. all possible nonboundry local minimal, this gets included. so if you then check those for the smallest you will find the global minima. And if you hit this point during gradient desent, you do infact want to stop, so derivative of 0 is useful here also.

Which is where we get the subgradient convention (which I will write up one day, it comes from a discussion I had at a conference.) We say "If the derivative is not defined, but the subgradient contains zero, then just say the derivative is 0". Or more directly if taking the derivative from all directions of approach forms a interval that encloses 0, stay the derivative is 0.

mcabbott commented 3 years ago

I think this thread urgently needs a graph:

power09

As x -> 0 from above, the gradient does go to positive real infinity. Whether to return Inf is another question, you could argue that a large finite value from say x = eps() might be nicer.

A few more cases to check, while I was at it. We should be sure to get all of these right.

power6

antoine-levitt commented 3 years ago

How did you obtain these plots? x^y for y non-integer and x negative is impossible to define in a consistent way

antoine-levitt commented 3 years ago

(eg see your 3.14 plot, which does not look like x^3)

mcabbott commented 3 years ago

They allow complex numbers, the dashed lines are the imaginary part when this is nonzero. To get one answer you have to pick a branch, and Julia does. x^(1//3) takes one branch, cbrt(x) takes a different one which is real at negative x.

antoine-levitt commented 3 years ago

It does not pick a branch? (nor should it)

julia> (-2.0)^(1//3)
ERROR: DomainError with -2.0:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
antoine-levitt commented 3 years ago

Ah, you gave it complex numbers, I see. This is different then. In that case the choice is inferred from the fact that zeros are signed:

julia> (-2.0+0.0im)^(1//3)
0.6299605249474366 + 1.0911236359717214im

julia> (-2.0-0.0im)^(1//3)
0.6299605249474366 - 1.0911236359717214im
mcabbott commented 3 years ago

Yes, there's a cut on the negative real axis in Julia's conventions. The 3rd choice is:

julia> cbrt(-2.0)
-1.2599210498948732

julia> cbrt(-2.0+0.0im)
ERROR: MethodError: no method matching cbrt(::ComplexF64)

julia> (-2.0+0.0im)^(1//3) |> abs
1.2599210498948732

julia> (-2.0-0.0im)^(1//3) |> abs
1.2599210498948732
antoine-levitt commented 3 years ago

Here's my opinion of what should happen here

The crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary of a definition where the function is undefined at the other side, but it's not when the function is discontinuous. Does that make sense?

mcabbott commented 3 years ago

For x=0, return the one-sided derivative: 0 for y >= 1, +Inf for y < 1

Except at y=1, where it's 1 -- the 2nd panel of my graph. And at y=0 it's zero. And negative y is allowed too.

But what I'm not sure about is whether returning Inf is a good idea. Maybe it would be better to regularise everything, for x=0 (in this one-sided case) always give the gradients for say x = eps(). This will change smoothly between the different cases.

antoine-levitt commented 3 years ago
julia> (-1.0)^(1.0)
-1.0

Oooh that's evil. Then yeah I agree with you: when y is integer (positive or negative), the rule should be that pow(x::Real, ::Real) follows pow(x::Real, y::Int). The derivative wrt y when y is integer and x <= 0 should be NaN

I don't see the point of regularizing: the one-sided derivative is infinite.

mcabbott commented 3 years ago

crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary

Yes, I think I agree. When at a boundary, we implicitly assume that this is the boundary of the domain of your parameter too, so of course you care about the one-sided derivative.

When at a singularity in the interior there's no obvious answer mathematically. Although sometimes, like 1/x and 1/x^2 at x=0, we should blithely declare the two one-sided infinities equal and return that. The cut on the -ve real axis I'd need to think a little more, might be NaN.

the point of regularizing: the one-sided derivative is infinite.

The point would be to try to save people from NaNs downstream. What you are ultimately doing (in my head at least) is gradient descent, and if your parameter is bounded to [0,1] sometimes you need an arrow telling you to move away from the wall. A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.

That said I'm far from 100% sure that this wouldn't have weird consequences elsewhere. It just seems worth some thought, before jumping from an infinite limit in the pure mathematics to the concrete Inf::Float64.

antoine-levitt commented 3 years ago

A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.

With what you suggest the arrow would be 1/eps(): if you're doing things like that, it will break: I would imagine you want to make sure it breaks fast, it's easier to debug.

mcabbott commented 3 years ago

The regulated arrow would depend on the function. So it goes smoothly from "infinite" to 0 for powers near 1:

julia> ForwardDiff.derivative(x -> x^0.9, 0 + eps())
33.08251262391458

julia> ForwardDiff.derivative(x -> x^0.999, 0 + eps())
1.0356643999187012

julia> ForwardDiff.derivative(x -> x^1.001, 0 + eps())
0.9655627827687262

julia> ForwardDiff.derivative(x -> x^1.1, 0 + eps())
0.029925175613304173

That was my thinking in writing x + eps() as the regulator. But perhaps this is far too big, and eps(0) too small... There may well be much better choices than perturbing x.

it will break

How obvious is this? If I have some parameter x in [0,1], and choose to re-parameterise to work with x2 = x^2 say, also in [0,1], then I will probably have many f(sqrt(x2)). It would be nice if these didn't cause problems. I agree it's entirely possible attempting to regulate this sqrt's gradient may cause far more problems than it solves.

mcabbott commented 3 years ago

Behaviour with ChainRules v1.11.4 now matches ForwardDiff, and the graph, for the the gradient with respect to x:

julia> Zygote.gradient(^, 0.0, 0.9)
(Inf, 0.0)

julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [0.0, 0.0, 0.0])

julia> f(x, y) = @show(x) ^ y;

julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
([Inf, Inf, Inf], [NaN, NaN, NaN])

ForwardDiff still has NaN for the p gradient, where I think it should have zero.

gaspardbb commented 1 year ago

Hi, I guess this is a related issue, but let me know if I should open a new issue.

I discovered inconsistent behavior when using abs and broadcasting. Here is a MRE:

using Zygote

x = [1.0, -1.0]
w = [1.0, 0.0]

g1 = gradient(x) do x
    sum(abs.(x .- w))
end  # ([1.0, -1.0],)

g2 = gradient(x) do x
    abs(x[1] - w[1]) + abs(x[2] - w[2])
end  # ([0.0, -1.0],)

Basically, both are correct in the sense that both provide a valid subgradient for abs in 0: g1 takes x -> 1 and g2 takes the more obvious x -> 0. But it took me a while to debug, I feel the fact that the results are inconsistent is really bothering.

Based on the first post in this thread, the issue might come (but I don't know enough of Zygote internals) from different definitions for the subgradient: DiffRules vs ChainRules.

Let me know if I can provide more info !