Closed awe2 closed 3 years ago
Hi @awe2,
Thanks for your question and interest in our work!
Yes, I understand your concern. And you are right, what I am doing is equivalent to summing up the output dim of the logit. In this case, it means I treat the output of the network as the sum of logit, instead of a multi-dim output. It will be much faster than back-propagating through each output dim, and it also works well. It is definitely possible to expand the NTK into [num_samples num_out_dim] x [num_samples num_out_dim]
Hope that helps!
It is certainly faster, but I can calculate the NTK by hand for my simple network, and the results aren't the same. I'm not a math whiz, do you expect that this transformation leaves the value of "condition number" unchanged?
If the output of the network is the sum of logit instead of the logits themselves, would the neural architectures you are searching over have the same response? i.e., I see an immediate application value to searching over architectures for image classification where we are (naively) concerned with networks that output logits, but I don't know the value of a search over networks that predict the sum of logits?
I'm relatively new to the field-- is there something I am overlooking?
Thanks for your questions.
ah, Perfect, I think I understand. Thanks!
Howdy!
In: https://github.com/VITA-Group/TENAS/blob/main/lib/procedures/ntk.py
on line 45:
I am confused about your calculation of the NTK, and believe that you may be misusing the first argument of the torch.Tensor.backward() function.
E.g.: when playing with the codebase with a very small 8 parameter network with 2 outputs:
Where for this explanation I have modified to:
whereby J I mean your 'grad' list for a single network:
e.g.: lines 45 & 46:
for
J: [tensor([[-0.6255, -0.5019, 0.1758, 0.1411, 0.0000, 0.0000, -0.0727, -0.4643], [ 0.9368, -0.0947, -0.2633, 0.0266, 0.0000, 0.0000, 0.0955, -0.0812]])]
=======
for
J: [tensor([[ 0.1540, 0.1236, -0.6473, -0.5194, -0.0727, -0.4643, 0.0000, 0.0000], [-0.2307, 0.0233, 0.9694, -0.0980, 0.0955, -0.0812, 0.0000, 0.0000]])]
=======
for
J: [tensor([[-0.4715, -0.3783, -0.4715, -0.3783, -0.0727, -0.4643, -0.0727, -0.4643], [ 0.7061, -0.0714, 0.7062, -0.0714, 0.0955, -0.0812, 0.0955, -0.0812]])]
"""
And so you can verify that your code is adding the two components together to get the last result.
The problem is that your Jacobian should have size: number_samples x [(number_outputs x number_weights)] ; See your own paper, page 2, where you show that the Jacobian's components are defined on the subscript i, the ith output of the model.
If I am right, then any network that has multiple outputs would have their NTK values incorrectly calculated, would have a time and memory footprint that is systematically reduced by the fact that these gradients are being pooled together.