Closed zoeoz closed 5 years ago
Although NNEF does not say anything about this, I believe this is mathematically well defined, and all frameworks should be doing it the same way: if you do the math (for example calculate back-propagated gradients for a+b
wrt a
and b
), it appears that you have to sum up the incoming errors to the bias term (so in the backward graph, there should be sum_reduce
operations inserted implicitly, that actually account for the shape difference if you squeeze out the reduced axes). In fact, even if the bias is not singular, but has a separate value for each channel, already there is a summation by spatial position for each channel (sum_reduce
for spatial axes).
Thanks. Very helpful answer. It appears you are correct, even when considering just spatial positions and no channels. In this case, we normally don't sum_reduce
on the other error terms for each bias value because they are zero, similar to the weights (see first part of figure below). If the bias is singular, the error terms are non-zero and need to be summed, similar to the inputs (see second part of figure).
Understanding that NNEF does not specify what it means to correctly evaluate or train a computation graph, is there any advice, best practices, or recommendations how to handle tensor shape conflicts that can arise from tensor broadcasts in a backpropagation calculation during the training of learning parameters?
For example, if the
bias
argument of thelinear
operation is a singular tensor (all dimension extents are 1) and the matrix productC
ofinput
andfilter
is a non-singular tensor, then in a typical "forward" evaluation pass of the computation graph, the addition ofbias
andC
is performed as a "one to many" broadcast operation where the value ofbias
is duplicated for each element ofC
without any issue. In a typical "backpropagation" calculation at training time, however,bias
andC
need to have the same shape since there are now chain-rule errors that need to be back-propagated from the results of the broadcast operation, creating an ambiguous "many to one" relationship between the chain-rule errors and the singularbias
value actually defined in the computation graph.In practice, the issue only arises at training time when the singular
bias
value is specified in the computation graph as a variable tensor, since in that case NNEF does not appear to define any mechanism for the user to resolve the tensor shape conflict. In other words, each implicit duplication of thebias
value in the broadcast has a unique chain-rule error associated with it that the user has no (standard) ability to access, since the shape of the variable tensor actually defined in the computation graph doesn't have enough volume.