diprism / fggs

Factor Graph Grammars in Python
MIT License
13 stars 3 forks source link

Don't assume in SumProduct that each input has the same sparsity pattern as its gradient (grad_t) #177

Closed ccshan closed 10 months ago

ccshan commented 10 months ago

Fixes: #173

To facilitate testing, add command-line option -G to differentiate wrt all factors

Defunctionalize what PatternedTensor.nonphysical() used to return

Add PatternedTensor.project() using Axis.unify()

Change default returned by PatternedTensor.grad() to NaN (because those gradients are not zero but uncomputed)

Work around torch_semiring_einsum's not handling 0 elements

davidweichiang commented 10 months ago

I don't understand everything that's going on here yet, but in brief, what is now the sparsity pattern of the gradients?

ccshan commented 10 months ago

I don't understand everything that's going on here yet, but in brief, what is now the sparsity pattern of the gradients?

Same as the inputs' :-P but we coerce the pattern explicitly (grad_t[el].project(np.paxes, np.vaxes)) so that the tensor dimensions match and any unknown gradient element is set to nan.

davidweichiang commented 10 months ago

OK, so suppose X is a diagonal matrix and y = X.sum(). Then dy/dX is a matrix of all 1's. What was happening before, and what happens with this PR? I think it should be okay to throw away all the off-diagonal gradients and keep just the diagonal.

Now suppose X is a full matrix and y = X.trace(). Then dy/dX is the identity matrix and I think the gradient could be stored sparsely, but dense is ok too, and the off-diagonal gradients should be 0, not NaN.

I'm having trouble understanding in what situation NaNs appear.

ccshan commented 10 months ago

OK, so suppose X is a diagonal matrix and y = X.sum(). Then dy/dX is a matrix of all 1's. What was happening before, and what happens with this PR? I think it should be okay to throw away all the off-diagonal gradients and keep just the diagonal.

Now suppose X is a full matrix and y = X.trace(). Then dy/dX is the identity matrix and I think the gradient could be stored sparsely, but dense is ok too, and the off-diagonal gradients should be 0, not NaN.

I have not tested the following by running any code (in which I design an FGG that denotes sum or trace), but conceptually: If the PatternedTensor dy/dX is a matrix of all 1's or an identity matrix stored densely, then what was happening before was we would just return this matrix but PyTorch expects a vector, hence the error Function SumProductBackward returned an invalid gradient. Now we "throw away all the off-diagonal gradients and keep just the diagonal". If the PatternedTensor dy/dX is the identity matrix stored sparsely, then we used to get it right by luck, but now we get it right by matching the sparsity pattern of the identity matrix against the sparsity pattern of X.

I'm having trouble understanding in what situation NaNs appear.

I'm also having trouble understanding in what situation NaNs appear, because I haven't delved into the calls to J, J_log, multi_solve, multi_mv. Maybe dy/dX is always denser than (or equally dense as) X?

(I understand your X to be an input to this torch.autograd.Function, in other words np.reincarnate(in_values[i]) where el, np = ctx.in_labels[i] for some i. I understand your y to be an output of this torch.autograd.Function, in other words something whose gradient dy/dX plays the role of grad_t[el].)

davidweichiang commented 10 months ago

Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?

ccshan commented 10 months ago

Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?

NaNs have never appeared in our tests.

chihyang commented 10 months ago

Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?

NaNs have never appeared in our tests.

In fact, NaN is in the gradient result. I changed the condition of printing gradients in bin/sum_product.py, now we can see the NaN in the printed gradients when -G is enabled (see https://github.com/diprism/fggs/actions/runs/6855599080). But these NaNs come from the function below after backward returns a result:

https://github.com/diprism/fggs/blob/ba90b1e324790512d30a7fc8281af6c20456cb67/fggs/indices.py#L688-L692

ccshan commented 10 months ago

Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?

NaNs have never appeared in our tests.

In fact, NaN is in the gradient result. I changed the condition of printing gradients in bin/sum_product.py, now we can see the NaN in the printed gradients when -G is enabled (see https://github.com/diprism/fggs/actions/runs/6855599080). But these NaNs come from the function below after backward returns a result:

Ok but that's when the user demands a gradient with respect to a sparse tensor, right? Do we have any test producing a gradient whose physical contains NaN?

chihyang commented 10 months ago

Do you mean that NaNs do in fact appear, but we don't know why? @chihyang do you have any insights into this?

NaNs have never appeared in our tests.

In fact, NaN is in the gradient result. I changed the condition of printing gradients in bin/sum_product.py, now we can see the NaN in the printed gradients when -G is enabled (see https://github.com/diprism/fggs/actions/runs/6855599080). But these NaNs come from the function below after backward returns a result:

Ok but that's when the user demands a gradient with respect to a sparse tensor, right? Do we have any test producing a gradient whose physical contains NaN?

No. In our current test cases no result from backward contains NaN.

davidweichiang commented 10 months ago

I'm not following the last few posts of this discussion...can you explain what "NaN is in the gradient result" means (and why)?

chihyang commented 10 months ago

I'm not following the last few posts of this discussion...can you explain what "NaN is in the gradient result" means (and why)?

Say an input is

[[2.0, 0.0, 0.0] 
 [0.0, 1.0, 0.0]]

which corresponds to

paxis = PhysicalAxis(2)
in1 = PatternedTensor(torch.asarray([2.0, 1.0]), [paxis], [paxis, SumAxis(0, paxis, 1)])

backward computes the gradients with respect to each non-zero value above, then the gradients corresponding to them are stored inside in1.physical, which is [v1, v2]. When PatternedTensor::grad is invoked, the following tensor will be returned

[[v1, NaN, NaN] 
 [NaN, v2, NaN]]

NaN is the default value used by the following line in PatternedTensor::grad

PatternedTensor(p, self.paxes, self.vaxes, nan)
ccshan commented 10 months ago

Maybe it helps to add that the statement "a PatternedTensor is equivalent to a Tensor with many zeros/defaults" is not accurate when it comes to autodiff. The gradient of a Tensor with many zeros/defaults may or may not have many zeros/defaults (as the example "y = X.sum()" illustrates), but in any case each element of the gradient is at least computed. In contrast, the gradient of a PatternedTensor is only computed where it is backed by physical storage. So after y.backward(), we put NaNs in the result of X.grad.to_dense() where the gradient is not computed.

davidweichiang commented 10 months ago

@chihyang Thanks, I understand now, and I haven't understood all the code changes made but I trust that it matches up with the explanation above.