Open willtebbutt opened 3 years ago
I guess the culprit is the discrepancy between the version of the rule DiffRules and ChainRules
In the latter we adopt a different convention from the former.
Hmm, using DiffRules seems to be more "correct". Thoughts?
Unclear to me -- the "subgradient" convention adopted in ChainRules works perfectly in the KernelFunctions use-case. Also, it's what we do for lots of other things (ReLUs, abs, etc). My guess would be it's a useful thing to do?
Hmm, using DiffRules seems to be more "correct". Thoughts?
Hmm, I interpreted this as meaning that you believe the convention adopted in DiffRules is more appropriate than the one in ChainRules. Was I correct in this interpretation?
Returning NaN
, or Inf
for this is just causing pain with little gain, for no real reason.
0.0
is fine
Considering x^y
for x=0
, and y=0.9
The deriviative wrt x
from below is not defined since the function errors if we make a small negative change, so we don't care about that case.
The deriviative wrt x
from above is zero.
Thus we might as well say the derivative is zero.
The deriviative wrt y
from below is Inf
The deriviative wrt y
from above is zero.
I would argue that around if a<=b
etc we are largely free to chose which branch we want for a==b
(there needs to be some provisos on that statement)
And here chosing zero is the choice that leads to more useful behavour.
Noone want to work with Inf
.
This also has the attractive property that if you enumerate all points that the derivative is zero, i.e. all possible nonboundry local minimal, this gets included. so if you then check those for the smallest you will find the global minima. And if you hit this point during gradient desent, you do infact want to stop, so derivative of 0 is useful here also.
Which is where we get the subgradient convention (which I will write up one day, it comes from a discussion I had at a conference.) We say "If the derivative is not defined, but the subgradient contains zero, then just say the derivative is 0". Or more directly if taking the derivative from all directions of approach forms a interval that encloses 0, stay the derivative is 0.
I think this thread urgently needs a graph:
As x -> 0 from above, the gradient does go to positive real infinity. Whether to return Inf
is another question, you could argue that a large finite value from say x = eps()
might be nicer.
A few more cases to check, while I was at it. We should be sure to get all of these right.
How did you obtain these plots? x^y for y non-integer and x negative is impossible to define in a consistent way
(eg see your 3.14 plot, which does not look like x^3)
They allow complex numbers, the dashed lines are the imaginary part when this is nonzero. To get one answer you have to pick a branch, and Julia does. x^(1//3)
takes one branch, cbrt(x)
takes a different one which is real at negative x
.
It does not pick a branch? (nor should it)
julia> (-2.0)^(1//3)
ERROR: DomainError with -2.0:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Ah, you gave it complex numbers, I see. This is different then. In that case the choice is inferred from the fact that zeros are signed:
julia> (-2.0+0.0im)^(1//3)
0.6299605249474366 + 1.0911236359717214im
julia> (-2.0-0.0im)^(1//3)
0.6299605249474366 - 1.0911236359717214im
Yes, there's a cut on the negative real axis in Julia's conventions. The 3rd choice is:
julia> cbrt(-2.0)
-1.2599210498948732
julia> cbrt(-2.0+0.0im)
ERROR: MethodError: no method matching cbrt(::ComplexF64)
julia> (-2.0+0.0im)^(1//3) |> abs
1.2599210498948732
julia> (-2.0-0.0im)^(1//3) |> abs
1.2599210498948732
Here's my opinion of what should happen here
The crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary of a definition where the function is undefined at the other side, but it's not when the function is discontinuous. Does that make sense?
For x=0, return the one-sided derivative: 0 for y >= 1, +Inf for y < 1
Except at y=1, where it's 1 -- the 2nd panel of my graph. And at y=0 it's zero. And negative y is allowed too.
But what I'm not sure about is whether returning Inf
is a good idea. Maybe it would be better to regularise everything, for x=0 (in this one-sided case) always give the gradients for say x = eps()
. This will change smoothly between the different cases.
julia> (-1.0)^(1.0)
-1.0
Oooh that's evil. Then yeah I agree with you: when y is integer (positive or negative), the rule should be that pow(x::Real, ::Real) follows pow(x::Real, y::Int). The derivative wrt y when y is integer and x <= 0 should be NaN
I don't see the point of regularizing: the one-sided derivative is infinite.
crucial difference between case 2 and case 3 is that it is permissible to use one-sided derivatives when at the boundary
Yes, I think I agree. When at a boundary, we implicitly assume that this is the boundary of the domain of your parameter too, so of course you care about the one-sided derivative.
When at a singularity in the interior there's no obvious answer mathematically. Although sometimes, like 1/x
and 1/x^2
at x=0, we should blithely declare the two one-sided infinities equal and return that. The cut on the -ve real axis I'd need to think a little more, might be NaN.
the point of regularizing: the one-sided derivative is infinite.
The point would be to try to save people from NaNs downstream. What you are ultimately doing (in my head at least) is gradient descent, and if your parameter is bounded to [0,1] sometimes you need an arrow telling you to move away from the wall. A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0
is (in this view) just a representation of a very small number, but never exactly at the boundary.
That said I'm far from 100% sure that this wouldn't have weird consequences elsewhere. It just seems worth some thought, before jumping from an infinite limit in the pure mathematics to the concrete Inf::Float64
.
A strictly infinite arrow is much more likely to break things than a large finite arrow -- and since you're doing floating point, you do accept some fuzziness, x === +0.0 is (in this view) just a representation of a very small number, but never exactly at the boundary.
With what you suggest the arrow would be 1/eps(): if you're doing things like that, it will break: I would imagine you want to make sure it breaks fast, it's easier to debug.
The regulated arrow would depend on the function. So it goes smoothly from "infinite" to 0 for powers near 1:
julia> ForwardDiff.derivative(x -> x^0.9, 0 + eps())
33.08251262391458
julia> ForwardDiff.derivative(x -> x^0.999, 0 + eps())
1.0356643999187012
julia> ForwardDiff.derivative(x -> x^1.001, 0 + eps())
0.9655627827687262
julia> ForwardDiff.derivative(x -> x^1.1, 0 + eps())
0.029925175613304173
That was my thinking in writing x + eps()
as the regulator. But perhaps this is far too big, and eps(0)
too small... There may well be much better choices than perturbing x
.
it will break
How obvious is this? If I have some parameter x in [0,1], and choose to re-parameterise to work with x2 = x^2
say, also in [0,1], then I will probably have many f(sqrt(x2))
. It would be nice if these didn't cause problems. I agree it's entirely possible attempting to regulate this sqrt
's gradient may cause far more problems than it solves.
Behaviour with ChainRules v1.11.4 now matches ForwardDiff, and the graph, for the the gradient with respect to x
:
julia> Zygote.gradient(^, 0.0, 0.9)
(Inf, 0.0)
julia> Zygote.gradient((x, y) -> sum(x .^ y), zeros(3), fill(0.9, 3))
([Inf, Inf, Inf], [0.0, 0.0, 0.0])
julia> f(x, y) = @show(x) ^ y;
julia> Zygote.gradient((x, y) -> sum(f.(x, y)), zeros(3), fill(0.9, 3))
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
x = Dual{Nothing}(0.0,1.0,0.0)
([Inf, Inf, Inf], [NaN, NaN, NaN])
ForwardDiff still has NaN for the p
gradient, where I think it should have zero.
Hi, I guess this is a related issue, but let me know if I should open a new issue.
I discovered inconsistent behavior when using abs
and broadcasting. Here is a MRE:
using Zygote
x = [1.0, -1.0]
w = [1.0, 0.0]
g1 = gradient(x) do x
sum(abs.(x .- w))
end # ([1.0, -1.0],)
g2 = gradient(x) do x
abs(x[1] - w[1]) + abs(x[2] - w[2])
end # ([0.0, -1.0],)
Basically, both are correct in the sense that both provide a valid subgradient for abs
in 0
: g1
takes x -> 1
and g2
takes the more obvious x -> 0
. But it took me a while to debug, I feel the fact that the results are inconsistent is really bothering.
Based on the first post in this thread, the issue might come (but I don't know enough of Zygote internals) from different definitions for the subgradient: DiffRules vs ChainRules.
Let me know if I can provide more info !
As of version 0.6.15, the following:
The same code in version 0.6.14:
the crucial difference being in the last case for each.
If you do this with ForwardDiff, you get
so this smells to me like ForwardDiff is being invoked for
f
, whereas individual chain rules are used forsum(x .^ y)
because Zygote un-fuses broadcasting where it can (it can't un-fusef
).@oxinabox @mcabbott @DhairyaLGandhi I'm assuming that this is related to our recent efficiency upgrades?
This is breaking the build in KernelFunctions.jl. We can of course allow the tests to fail for the time being, but it essentially means that we can't use Zygote with one of our kernels.