Closed philferriere closed 3 years ago
The regular one would yield a huge weight (size of preliminary layer * number of classes) for each example and would cause memory / storage to blow up. We leverage the decomposition of grad(loss, weights of last fully connected layer) = outer product(loss grad, outputs of preliminary layer) through the chain rule. Details can be found in the appendix F of the paper. Appendix F - Fast Random Projections for Gradients of Fully-Connected Layers Note: random projection is optional and would lose information.
Oh, stupid me, you're right. It's been staring me in the face all along. It's basically the Appendix F/bottom of page 14 fast/final expression o(m + n) instead of o(m x n) where m=2048 and n=1000, based on this network definition:
# For the resnet model definition, see https://github.com/frederick0329/TracIn/blob/master/imagenet/resnet50/resnet.py
# Layer[-3]:
# x = tf.keras.layers.GlobalAveragePooling2D()(x) [2048 floats]
# Layer[-2]:
# x = tf.keras.layers.Dense(
# num_classes,
# kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01),
# kernel_regularizer=_gen_l2_regularizer(use_l2_regularizer),
# bias_regularizer=_gen_l2_regularizer(use_l2_regularizer),
# name='fc1000')(x) [1000 floats]
# Layer[-1]:
# # A softmax that is followed by the model loss cannot be done
# # in float16 due to numeric issues. So we pass dtype=float32.
# x = tf.keras.layers.Activation('softmax', dtype='float32')(x) [1000 probas]
Brilliant!
Hello again, @frederick!
I have a question about a difference between the paper and the notebook you've kindly shared with us. To help capture the difference I'm interested in discussing here, I've rewritten the find() function in the notebook so it's more obvious:
Above, I've added a comment to the line that implements Equation (1) in the paper. What's interesting to me is that the formulation you shared in the notebook is different. It's the one where you compute both loss gradient similarity and activation similarity and compute the product between the two.
Would you mind sharing with us what's the motivation for using this formulation instead of the "regular" one?
Thanks again!
-- Phil