andreas-abel / nanoBench

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
http://www.uops.info
GNU Affero General Public License v3.0
435 stars 55 forks source link

pinsrw latency overestimated(?) because dep chain competes for the same port #23

Closed pcordes closed 3 years ago

pcordes commented 3 years ago

https://uops.info/html-lat/SKX/PINSRW_XMM_R32_I8-Measurements.html#lat1-%3E1 experiments only use pinsrw xmm, r32, imm alone, or pinsrw with an XMM->XMM dep chain created by shufpd or pshufd.

But pinsrw itself is 2 uops for port 5 on Intel. Presumably a movd-equivalent uop to feed a 2-input shuffle. One would expect that the GP->XMM (movd) uop could run early if there was a free port, leaving the critical path latency from 1->1 being only 1 cycle.

But resource conflicts with the dep chain prevent this from being demonstrated. Perhaps pand xmm0,xmm0 would be a better choice for at least one of the experiments, or orps xmm0, xmm0. (I guess shufpd and pshufd are looking for bypass latency between integer and FP shuffles?)