Closed amonakov closed 2 years ago
The fix for issue #15 did not include a correction for 256-bit stores: like loads, they have half throughput compared to their 128-bit SSE counterparts, and the following loop runs at two cycles per iteration:
loop: vmovaps [rdi], ymm0 dec ecx jnz loop
The fix for issue #15 did not include a correction for 256-bit stores: like loads, they have half throughput compared to their 128-bit SSE counterparts, and the following loop runs at two cycles per iteration: