Closed amonakov closed 2 years ago
Sorry, is the new comment accurate? It says,
# after a 256-bit load, no other load can be executed on the same port in the next cycle
but the what the code seems to implement is "a 256-bit load makes both load ports unavailable for another load on the same cycle:
On Sandy Bridge (and Ivy Bridge), 256-bit AVX loads and stores had half throughput compared to their 128-bit SSE counterparts, so ideally uiCA should show that the following loop runs at 2 cycles per iteration:
(I guess this might be not straightforward to model, since it's not the same as if load uop occupied port2/3 for two cycles, because on the second cycle the port still can perform store-address part of another store uop?)