Open breuera opened 2 years ago
I appears that https://github.com/facebookresearch/fvcore/issues/69#issue-895213776 raises the same concern.
We count one fused multiply-add as one flop.
I'd consider this to be an unconventional definition. Even knowing this, I don't understand how the ops of the bias fit in there.
Different groups adopt different conventions, unfortunately. We implemented the convention in computer vision, which is to use MACs and ignore the flops of bias.
The matmul flop counts seem to be off by 2x.
I tested the code on a simple MLP which reads as:
Embedded this in some code with the crucial piece here:
This returns:
Let's take the first linear layer as an example: Matrix A in https://pytorch.org/docs/stable/generated/torch.nn.Linear.html has shape (512, 784). Matrix x (since the example batched) has shape (64, 784). Computing the result, C=xA^T requires 2*64*512*784 - 64*512 floating point operations. However, in the example a bias is used, i.e., 64*512 additions on top -> 2*64*512*784=513,80,224 flops total; the tool reports 25,690,112 for the first layer. btw: I am not sure why the bias doesn't show up separately.
I believe that the code below is off since the number of ops of the op C+=AB using BLAS identifiers is 2*M*N*K not M*N*K:
https://github.com/facebookresearch/fvcore/blob/e4f0b3d1fa9ed610a5568932ab7aaf5a37cd75ca/fvcore/nn/jit_handles.py#L225