Tensor broadcasting and backpropagation

zoeoz commented 5 years ago

Understanding that NNEF does not specify what it means to correctly evaluate or train a computation graph, is there any advice, best practices, or recommendations how to handle tensor shape conflicts that can arise from tensor broadcasts in a backpropagation calculation during the training of learning parameters?

For example, if the bias argument of the linear operation is a singular tensor (all dimension extents are 1) and the matrix product C of input and filter is a non-singular tensor, then in a typical "forward" evaluation pass of the computation graph, the addition of bias and C is performed as a "one to many" broadcast operation where the value of bias is duplicated for each element of C without any issue. In a typical "backpropagation" calculation at training time, however, bias and C need to have the same shape since there are now chain-rule errors that need to be back-propagated from the results of the broadcast operation, creating an ambiguous "many to one" relationship between the chain-rule errors and the singular bias value actually defined in the computation graph.

In practice, the issue only arises at training time when the singular bias value is specified in the computation graph as a variable tensor, since in that case NNEF does not appear to define any mechanism for the user to resolve the tensor shape conflict. In other words, each implicit duplication of the bias value in the broadcast has a unique chain-rule error associated with it that the user has no (standard) ability to access, since the shape of the variable tensor actually defined in the computation graph doesn't have enough volume.

gyenesvi commented 5 years ago

Although NNEF does not say anything about this, I believe this is mathematically well defined, and all frameworks should be doing it the same way: if you do the math (for example calculate back-propagated gradients for a+b wrt a and b), it appears that you have to sum up the incoming errors to the bias term (so in the backward graph, there should be sum_reduce operations inserted implicitly, that actually account for the shape difference if you squeeze out the reduced axes). In fact, even if the bias is not singular, but has a separate value for each channel, already there is a summation by spatial position for each channel (sum_reduce for spatial axes).

zoeoz commented 5 years ago

Thanks. Very helpful answer. It appears you are correct, even when considering just spatial positions and no channels. In this case, we normally don't sum_reduce on the other error terms for each bias value because they are zero, similar to the weights (see first part of figure below). If the bias is singular, the error terms are non-zero and need to be summed, similar to the inputs (see second part of figure).

linear_jacobian

KhronosGroup / NNEF-Docs

Tensor broadcasting and backpropagation #15