Closed adrhill closed 5 months ago
See #57
Reopening since #59 didn't add local tracers.
Operators with local behavior:
min
and max
: only one of the argument matters at any given point => min(x, bigy) = x
mod
and rem
: depending on their relative sizes and signs, one of the arguments may not matter => mod(x, bigy) = x
Once we have primal values, we will also need to add new operators, mostly those that work with Bool
:
isequal
, isless
, isfinite
, isinf
, isnan
==
, <
, <=
and co (check the fallback structure for those again)In a potential NNLib extension, we could add global and local classification of activation functions. Here are the ones with local sparsity:
hardsigmoid
hardtanh
relu
relu6
softshrink
hardswish
trelu
Oh and of course we need to add ifelse
:
function Base.ifelse(b::Bool, xt::GlobalTracer, yt::GlobalTracer)
return GlobalTracer(union(xt.inputs, yt.inputs))
end
function Base.ifelse(b::Bool, xt::LocalTracer, yt::LocalTracer)
return ifelse(b, xt, yt)
end
Do we need to reintroduce local sparsity when we:
My answer would be yes, but only if the zero in question is a constant (not a variable which we are also tracing).
Then the question becomes: do we ever encounter non-traced numbers during tracing? It seems unlikely, since we're converting everything to tracers.
The current structure of our code would make it trivial to support the zero-product property and the additive identity.
And we could generally return empty tracers if the primal computation returns zero.
threshold to zero
What do you mean by this?
And we could generally return empty tracers if the primal computation returns zero.
I'm worried we need to be extra careful when the zeros are not fixed but variables. For instance, we don't want to return an empty tracer for
f(x) = x .* x
jacobian_sparsity(f, [0.0])
even though there is a multiplication by zero occurring in the primal computation
we don't want to return an empty tracer
An argument could be made that the "global" Jacobian of f
is 2x
and that the "local" Jacobian of f
evaluated at x=0
therefore is zero.
In other words, the local tracer could compute
$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial\alpha \varphi\vert\alpha] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial\beta \varphi\vert\beta] \cdot \mathbb{1}[\nabla \beta]$
instead of
$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial\beta \varphi] \cdot \mathbb{1}[\nabla \beta] \quad .$
This is already what the theory means, no need for the $\vert_\alpha$ in addition. What you're suggesting is something else that I can't quite put my finger on
What about
f(x) = x[1] * x[2]
jacobian_sparsity(f, [0.0, 1.0])
In the old world, we trace both inputs, and the output gets a dependency on both.
In the new world, we would only trace the second input? The first one gets an empty tracer from the start?
Basically, our global tracers $\mathbb{1}[\nabla \alpha] \neq 0$ mean "the gradient is not uniformly zero". Our local tracers (the way we saw them until now) mean "the gradient is not uniformly zero on a neighborhood of the current point". You want to restrict them further to "the gradient is not zero at the current point".
Right, what I'm suggesting is to give an estimate of the sparsity of $J_f\vert_x$ that is as non-conservative as possible. Counter-intuitively, this could actually make the code run faster.
Right but if we give every zero value an empty tracer, that boils down to saying "when the primal value is zero, every coefficient of the gradient / Hessian that involves the corresponding variable evaluates to zero at the current point". Which I am almost certain is wrong
I would steer clear of "practical speedups" that are not first validated by theory
f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])
If we give an empty tracer to the first index, we get something wrong regardless of what we do next
Essentially, even when we have zeros in our computational graph, we can't afford to forget where they come from. Zeros are only special at the end of the graph
I think you're misreading my argument. You said:
we don't want to return an empty tracer for
f(x) = x .* x jacobian_sparsity(f, [0.0])
even though there is a multiplication by zero occurring in the primal computation
My point is that in this case, as written explicitly above, $\mathbb{1}[\partial\alpha \varphi\vert\alpha] = \mathbb{1}[\partial\beta \varphi\vert\beta] = 0$, so an empty tracer can be returned.
In the case of
f(x) = exp(x[1]) * x[2] jacobian_sparsity(f, [0.0, 1.0])
$\mathbb{1}[\partial\alpha \varphi\vert\alpha] \neq 0$, so according to the theory, the resulting tracer isn't empty.
Yeah that was more of an answer to your comment on Slack, which we now both agree is wrong. I think your general idea however is right, and it only requires being more precise with the derivatives of the operators $\varphi$. The question is how precise.
For instance, derivatives of cos
are zero on all of $\mathbb{Z}\pi$, but that's pretty useless in practice cause we have a low chance of hitting those primal values.
On the other hand, the derivative of abs2
at zero is zero, and that's one we're relatively likely to hit.
And what we really don't want to screw up is operators with more than one valid derivative, aka points of nondifferentiability. For instance, abs
is not differentiable at zero, so what each autodiff backend returns is an implementation detail. It doesn't have to be 0
, it doesn't even have to be in the subdifferential [-1, 1]
. In that case, $\mathbb{1}[\partial \varphi]$ should only be 0
if all acceptable derivatives of the operator are zero, and one otherwise.
How about for
f(x) = exp(x[1] * x[2]) jacobian_sparsity(f, [0.0, 1.0])
When we compute the product
x[1] * x[2]
, even thoughx[1] == 0.0
, we cannot ditch the tracing for either argument, otherwise the final sparsity will be wrong
I suggest we just look at f(x) = x[1] * x[2]
at [0.0 1.0]
, since we overload at the operator level.
The corresponding Jacobian is $J_f = [x_2, x_1]$ and therefore $J_f\vert_x = [1, 0]$.
Theory tells us $\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial\beta \varphi] \cdot \mathbb{1}[\nabla \beta] = \mathbb{1}[\nabla \alpha]$, so somewhat unintuitively, only the gradient information from $x_1$ is propagated, resulting in [1 0]
, matching the sparsity pattern of the Jacobian above.
The corresponding Jacobian
You mean gradient.
And yes, this theory would be implemented by the following functions
nonzero_der1_arg1(*, a, b) = !iszero(b)
nonzero_der1_arg2(*, a, b) = !iszero(a)
I think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.
think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.
Opened #66 to track this.
Once we have primal values, we will also need to add new operators, mostly those that work with
Bool
:
isequal
,isless
,isfinite
,isinf
,isnan
- possibly their friends
==
,<
,<=
and co (check the fallback structure for those again)
As a friendly reminder to myself tomorrow: we should also test on @Vaibhavdixit02's matrix exponentials.
Currently, all tracers return conservative global sparsity patterns. Local sparsity can be computed by propagating a primal value as well.