This is exactly twice the reported performance number above. Based on my own benchmarks (and the performance analysis provided by the ECM model in kerncraft) I conclude that the cy/CL number is too small by a factor of 2 but the performance output is correct. I suspect this has to to with the Himeno benchmark using single-precision data.
The performance output (4952 MFLOP/s) does not match the runtime per CL output:
(16 LUPs / 121 cy) 34 FLOP/LUP 2.2 Gcy/s = 9.89 GFLOP/s
This is exactly twice the reported performance number above. Based on my own benchmarks (and the performance analysis provided by the ECM model in kerncraft) I conclude that the cy/CL number is too small by a factor of 2 but the performance output is correct. I suspect this has to to with the Himeno benchmark using single-precision data.