Open betatim opened 7 years ago
This is what you get by running perf
on all of a big, complex script from khmer:
Performance counter stats for 'python scripts/abundance-dist-single-threaded.py -s -x 1e8 -N 4 -k 17 -z ecoli_ref-5m.fastq /tmp/test.dist':
78449.011717 task-clock:u (msec) # 1.145 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
9,409 page-faults:u # 0.120 K/sec
261,184,636,220 cycles:u # 3.329 GHz
170,206,255,740 instructions:u # 0.65 insn per cycle
35,950,481,141 branches:u # 458.266 M/sec
592,863,219 branch-misses:u # 1.65% of all branches
60,473,091,195 L1-dcache-loads:u # 770.859 M/sec
2,261,330,633 L1-dcache-load-misses:u # 3.74% of all L1-dcache hits
1,631,332,334 LLC-loads:u # 20.795 M/sec
1,585,233,019 LLC-load-misses:u # 97.17% of all LL-cache hits
68.530882400 seconds time elapsed
It is somewhat similar to the emulated BF case, but need to do more thinking and tinkering before I have any ideas on what we learn from this. One thing we do learn is that we execute less than one instruction per CPU cycle 😢
I'm having trouble understanding the output - to me it looks like random has fewer cache misses than linear?! I'm looking at LLC-load-misses:u...
These are notes from experiments and exploring the tools for understanding cache misses and the like.
Most of the knowledge about
perf
comes from https://www.youtube.com/watch?v=nXaxk27zwlkCode and instructions if you are on linux: https://gist.github.com/betatim/181595d17320012945baef3386e09bc5
The code contains three benchmarks. One accesses elements of a bitset at random, one accesses them linearly and one emulates a BF. I used the first two to understand the output of
perf
. For the linear access pattern you expect nearly no cache misses at any level, high number of instructions per cycle, etc. For the random access pattern things should get worse.This is what I see: