The average latency is reduced from 13 clock cycles to 8 clock cycles.
The reduction is achieved by performing the initial normalization in one clock cycle, rather than in five clock cycles. Furthermore, one additional clock cycle at end of calculation is removed (state "preoutput").
Interestingly, the utilization report says that the number of registers is reduced (from 241 to 180), but the number of LUTs is increased (from 648 to 670). Not a huge difference.
The average latency is reduced from 13 clock cycles to 8 clock cycles.
The reduction is achieved by performing the initial normalization in one clock cycle, rather than in five clock cycles. Furthermore, one additional clock cycle at end of calculation is removed (state "preoutput").