andreas-abel / uiCA

uops.info Code Analyzer
GNU Affero General Public License v3.0
230 stars 16 forks source link

Simulation inaccuracy for 256-bit loads/stores on SNB #15

Closed amonakov closed 2 years ago

amonakov commented 2 years ago

On Sandy Bridge (and Ivy Bridge), 256-bit AVX loads and stores had half throughput compared to their 128-bit SSE counterparts, so ideally uiCA should show that the following loop runs at 2 cycles per iteration:

loop:
vmovaps ymm0, [rsi]
vmovaps ymm0, [rsi]
dec ecx
jnz loop

(I guess this might be not straightforward to model, since it's not the same as if load uop occupied port2/3 for two cycles, because on the second cycle the port still can perform store-address part of another store uop?)

amonakov commented 2 years ago

Sorry, is the new comment accurate? It says,

# after a 256-bit load, no other load can be executed on the same port in the next cycle

but the what the code seems to implement is "a 256-bit load makes both load ports unavailable for another load on the same cycle: image