Closed ccshan closed 11 months ago
I don't understand everything that's going on here yet, but in brief, what is now the sparsity pattern of the gradients?
I don't understand everything that's going on here yet, but in brief, what is now the sparsity pattern of the gradients?
Same as the inputs' :-P but we coerce the pattern explicitly (grad_t[el].project(np.paxes, np.vaxes)
) so that the tensor dimensions match and any unknown gradient element is set to nan
.
OK, so suppose X is a diagonal matrix and y = X.sum(). Then dy/dX is a matrix of all 1's. What was happening before, and what happens with this PR? I think it should be okay to throw away all the off-diagonal gradients and keep just the diagonal.
Now suppose X is a full matrix and y = X.trace(). Then dy/dX is the identity matrix and I think the gradient could be stored sparsely, but dense is ok too, and the off-diagonal gradients should be 0, not NaN.
I'm having trouble understanding in what situation NaNs appear.
OK, so suppose X is a diagonal matrix and y = X.sum(). Then dy/dX is a matrix of all 1's. What was happening before, and what happens with this PR? I think it should be okay to throw away all the off-diagonal gradients and keep just the diagonal.
Now suppose X is a full matrix and y = X.trace(). Then dy/dX is the identity matrix and I think the gradient could be stored sparsely, but dense is ok too, and the off-diagonal gradients should be 0, not NaN.
I have not tested the following by running any code (in which I design an FGG that denotes sum
or trace
), but conceptually:
If the PatternedTensor dy/dX is a matrix of all 1's or an identity matrix stored densely, then what was happening before was we would just return this matrix but PyTorch expects a vector, hence the error Function SumProductBackward returned an invalid gradient
. Now we "throw away all the off-diagonal gradients and keep just the diagonal".
If the PatternedTensor dy/dX is the identity matrix stored sparsely, then we used to get it right by luck, but now we get it right by matching the sparsity pattern of the identity matrix against the sparsity pattern of X.
I'm having trouble understanding in what situation NaNs appear.
I'm also having trouble understanding in what situation NaNs appear, because I haven't delved into the calls to J
, J_log
, multi_solve
, multi_mv
.
Maybe dy/dX is always denser than (or equally dense as) X?
(I understand your X to be an input to this torch.autograd.Function
, in other words np.reincarnate(in_values[i])
where el, np = ctx.in_labels[i]
for some i
. I understand your y to be an output of this torch.autograd.Function
, in other words something whose gradient dy/dX plays the role of grad_t[el]
.)
Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?
Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?
NaNs have never appeared in our tests.
Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?
NaNs have never appeared in our tests.
In fact, NaN
is in the gradient result. I changed the condition of printing gradients in bin/sum_product.py, now we can see the NaN
in the printed gradients when -G
is enabled (see https://github.com/diprism/fggs/actions/runs/6855599080). But these NaN
s come from the function below after backward
returns a result:
Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?
NaNs have never appeared in our tests.
In fact,
NaN
is in the gradient result. I changed the condition of printing gradients in bin/sum_product.py, now we can see theNaN
in the printed gradients when-G
is enabled (see https://github.com/diprism/fggs/actions/runs/6855599080). But theseNaN
s come from the function below afterbackward
returns a result:
Ok but that's when the user demands a gradient with respect to a sparse tensor, right? Do we have any test producing a gradient whose physical
contains NaN
?
Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?
NaNs have never appeared in our tests.
In fact,
NaN
is in the gradient result. I changed the condition of printing gradients in bin/sum_product.py, now we can see theNaN
in the printed gradients when-G
is enabled (see https://github.com/diprism/fggs/actions/runs/6855599080). But theseNaN
s come from the function below afterbackward
returns a result:Ok but that's when the user demands a gradient with respect to a sparse tensor, right? Do we have any test producing a gradient whose
physical
containsNaN
?
No. In our current test cases no result from backward
contains NaN
.
I'm not following the last few posts of this discussion...can you explain what "NaN is in the gradient result" means (and why)?
I'm not following the last few posts of this discussion...can you explain what "NaN is in the gradient result" means (and why)?
Say an input is
[[2.0, 0.0, 0.0]
[0.0, 1.0, 0.0]]
which corresponds to
paxis = PhysicalAxis(2)
in1 = PatternedTensor(torch.asarray([2.0, 1.0]), [paxis], [paxis, SumAxis(0, paxis, 1)])
backward
computes the gradients with respect to each non-zero value above, then the gradients corresponding to them are stored inside in1.physical
, which is [v1, v2]
. When PatternedTensor::grad
is invoked, the following tensor will be returned
[[v1, NaN, NaN]
[NaN, v2, NaN]]
NaN
is the default value used by the following line in PatternedTensor::grad
PatternedTensor(p, self.paxes, self.vaxes, nan)
Maybe it helps to add that the statement "a PatternedTensor is equivalent to a Tensor with many zeros/defaults" is not accurate when it comes to autodiff. The gradient of a Tensor with many zeros/defaults may or may not have many zeros/defaults (as the example "y = X.sum()" illustrates), but in any case each element of the gradient is at least computed. In contrast, the gradient of a PatternedTensor is only computed where it is backed by physical
storage. So after y.backward()
, we put NaN
s in the result of X.grad.to_dense()
where the gradient is not computed.
@chihyang Thanks, I understand now, and I haven't understood all the code changes made but I trust that it matches up with the explanation above.
Fixes: #173
To facilitate testing, add command-line option -G to differentiate wrt all factors
Defunctionalize what PatternedTensor.nonphysical() used to return
Add PatternedTensor.project() using Axis.unify()
Change default returned by PatternedTensor.grad() to NaN (because those gradients are not zero but uncomputed)
Work around torch_semiring_einsum's not handling 0 elements