andreas-abel / nanoBench

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
http://www.uops.info
GNU Affero General Public License v3.0
435 stars 55 forks source link

vpternlogd latencies on Zen4 #29

Open amonakov opened 1 year ago

amonakov commented 1 year ago

On Zen 4, summary of vpternlogd latency experiments is given as

Latency operand 1 → 1: 1 Latency operand 2 → 1: 2 Latency operand 3 → 1: 1

https://uops.info/html-lat/ZEN4/VPTERNLOGD_ZMM_ZMM_ZMM_I8-Measurements.html

but I don't see a substantial difference in 3 → 1 vs. 2 → 1 experiments, or a difference w.r.t its vpternlogq sibling, where all latencies are listed as 1. Shouldn't both dword and qword variants be listed with latency 2 for operands 2 and 3? What am I missing?

If I'm reading Agner's testing harness right, his latency experiment times

vpternlogd zmm0, zmm1, zmm2
vpternlogd zmm2, zmm1, zmm0

repeated 50 times. He lists latency of ternlog on Zen 4 as 1 cycle in all cases (but if latency from second operand is indeed 2, his experiment wouldn't uncover that).

(unfortunately I do not have access to a Zen 4 machine to run more experiments)