adrhill / SparseConnectivityTracer.jl

Fast operator-overloading Jacobian & Hessian sparsity detection.
MIT License
26 stars 2 forks source link

Add local tracers #56

Closed adrhill closed 5 months ago

adrhill commented 6 months ago

Currently, all tracers return conservative global sparsity patterns. Local sparsity can be computed by propagating a primal value as well.

gdalle commented 6 months ago

See #57

adrhill commented 5 months ago

Reopening since #59 didn't add local tracers.

gdalle commented 5 months ago

Operators with local behavior:

gdalle commented 5 months ago

Once we have primal values, we will also need to add new operators, mostly those that work with Bool:

gdalle commented 5 months ago

In a potential NNLib extension, we could add global and local classification of activation functions. Here are the ones with local sparsity:

gdalle commented 5 months ago

Oh and of course we need to add ifelse:

function Base.ifelse(b::Bool, xt::GlobalTracer, yt::GlobalTracer)
    return GlobalTracer(union(xt.inputs, yt.inputs))
end

function Base.ifelse(b::Bool, xt::LocalTracer, yt::LocalTracer)
    return ifelse(b, xt, yt)
end
gdalle commented 5 months ago

Do we need to reintroduce local sparsity when we:

My answer would be yes, but only if the zero in question is a constant (not a variable which we are also tracing).

Then the question becomes: do we ever encounter non-traced numbers during tracing? It seems unlikely, since we're converting everything to tracers.

adrhill commented 5 months ago

The current structure of our code would make it trivial to support the zero-product property and the additive identity.

And we could generally return empty tracers if the primal computation returns zero.

threshold to zero

What do you mean by this?

gdalle commented 5 months ago

And we could generally return empty tracers if the primal computation returns zero.

I'm worried we need to be extra careful when the zeros are not fixed but variables. For instance, we don't want to return an empty tracer for

f(x) = x .* x
jacobian_sparsity(f, [0.0])

even though there is a multiplication by zero occurring in the primal computation

adrhill commented 5 months ago

we don't want to return an empty tracer

An argument could be made that the "global" Jacobian of f is 2x and that the "local" Jacobian of f evaluated at x=0 therefore is zero.

adrhill commented 5 months ago

In other words, the local tracer could compute

$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial\alpha \varphi\vert\alpha] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial\beta \varphi\vert\beta] \cdot \mathbb{1}[\nabla \beta]$

instead of

$\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial\beta \varphi] \cdot \mathbb{1}[\nabla \beta] \quad .$

gdalle commented 5 months ago

This is already what the theory means, no need for the $\vert_\alpha$ in addition. What you're suggesting is something else that I can't quite put my finger on

gdalle commented 5 months ago

What about

f(x) = x[1] * x[2]
jacobian_sparsity(f, [0.0, 1.0])

In the old world, we trace both inputs, and the output gets a dependency on both.

In the new world, we would only trace the second input? The first one gets an empty tracer from the start?

gdalle commented 5 months ago

Basically, our global tracers $\mathbb{1}[\nabla \alpha] \neq 0$ mean "the gradient is not uniformly zero". Our local tracers (the way we saw them until now) mean "the gradient is not uniformly zero on a neighborhood of the current point". You want to restrict them further to "the gradient is not zero at the current point".

adrhill commented 5 months ago

Right, what I'm suggesting is to give an estimate of the sparsity of $J_f\vert_x$ that is as non-conservative as possible. Counter-intuitively, this could actually make the code run faster.

gdalle commented 5 months ago

Right but if we give every zero value an empty tracer, that boils down to saying "when the primal value is zero, every coefficient of the gradient / Hessian that involves the corresponding variable evaluates to zero at the current point". Which I am almost certain is wrong

gdalle commented 5 months ago

I would steer clear of "practical speedups" that are not first validated by theory

gdalle commented 5 months ago
f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])

If we give an empty tracer to the first index, we get something wrong regardless of what we do next

gdalle commented 5 months ago

Essentially, even when we have zeros in our computational graph, we can't afford to forget where they come from. Zeros are only special at the end of the graph

adrhill commented 5 months ago

I think you're misreading my argument. You said:

we don't want to return an empty tracer for

f(x) = x .* x
jacobian_sparsity(f, [0.0])

even though there is a multiplication by zero occurring in the primal computation

My point is that in this case, as written explicitly above, $\mathbb{1}[\partial\alpha \varphi\vert\alpha] = \mathbb{1}[\partial\beta \varphi\vert\beta] = 0$, so an empty tracer can be returned.

In the case of

f(x) = exp(x[1]) * x[2]
jacobian_sparsity(f, [0.0, 1.0])

$\mathbb{1}[\partial\alpha \varphi\vert\alpha] \neq 0$, so according to the theory, the resulting tracer isn't empty.

gdalle commented 5 months ago

Yeah that was more of an answer to your comment on Slack, which we now both agree is wrong. I think your general idea however is right, and it only requires being more precise with the derivatives of the operators $\varphi$. The question is how precise.

For instance, derivatives of cos are zero on all of $\mathbb{Z}\pi$, but that's pretty useless in practice cause we have a low chance of hitting those primal values. On the other hand, the derivative of abs2 at zero is zero, and that's one we're relatively likely to hit.

gdalle commented 5 months ago

And what we really don't want to screw up is operators with more than one valid derivative, aka points of nondifferentiability. For instance, abs is not differentiable at zero, so what each autodiff backend returns is an implementation detail. It doesn't have to be 0, it doesn't even have to be in the subdifferential [-1, 1]. In that case, $\mathbb{1}[\partial \varphi]$ should only be 0 if all acceptable derivatives of the operator are zero, and one otherwise.

adrhill commented 5 months ago

How about for

f(x) = exp(x[1] * x[2])
jacobian_sparsity(f, [0.0, 1.0])

When we compute the product x[1] * x[2], even though x[1] == 0.0, we cannot ditch the tracing for either argument, otherwise the final sparsity will be wrong

I suggest we just look at f(x) = x[1] * x[2] at [0.0 1.0], since we overload at the operator level.

The corresponding Jacobian is $J_f = [x_2, x_1]$ and therefore $J_f\vert_x = [1, 0]$.

Theory tells us $\mathbb{1}[\nabla \gamma] = \mathbb{1}[\partial\alpha \varphi] \cdot \mathbb{1}[\nabla \alpha] \vee \mathbb{1}[\partial\beta \varphi] \cdot \mathbb{1}[\nabla \beta] = \mathbb{1}[\nabla \alpha]$, so somewhat unintuitively, only the gradient information from $x_1$ is propagated, resulting in [1 0], matching the sparsity pattern of the Jacobian above.

gdalle commented 5 months ago

The corresponding Jacobian

You mean gradient.

And yes, this theory would be implemented by the following functions

nonzero_der1_arg1(*, a, b) = !iszero(b)
nonzero_der1_arg2(*, a, b) = !iszero(a)
gdalle commented 5 months ago

I think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.

adrhill commented 5 months ago

think we should add tests for these finer internal functions with ForwardDiff, instead of testing only the global classification of operators. They are very easy to screw up without realizing.

Opened #66 to track this.

adrhill commented 5 months ago

Once we have primal values, we will also need to add new operators, mostly those that work with Bool:

  • isequal, isless, isfinite, isinf, isnan
  • possibly their friends ==, <, <= and co (check the fallback structure for those again)

As a friendly reminder to myself tomorrow: we should also test on @Vaibhavdixit02's matrix exponentials.