dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
757 stars 295 forks source link

Notes on cache misses #1557

Open betatim opened 7 years ago

betatim commented 7 years ago

These are notes from experiments and exploring the tools for understanding cache misses and the like.

Most of the knowledge about perf comes from https://www.youtube.com/watch?v=nXaxk27zwlk

Code and instructions if you are on linux: https://gist.github.com/betatim/181595d17320012945baef3386e09bc5

The code contains three benchmarks. One accesses elements of a bitset at random, one accesses them linearly and one emulates a BF. I used the first two to understand the output of perf. For the linear access pattern you expect nearly no cache misses at any level, high number of instructions per cycle, etc. For the random access pattern things should get worse.

This is what I see:

 Performance counter stats for './bench --benchmark_filter=linear':

       1011.369516      task-clock:u (msec)       #    0.995 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,614      page-faults:u             #    0.002 M/sec                  
     3,422,721,010      cycles:u                  #    3.384 GHz                    
    13,666,851,946      instructions:u            #    3.99  insn per cycle         
       920,682,115      branches:u                #  910.332 M/sec                  
            43,495      branch-misses:u           #    0.00% of all branches        
     3,626,905,207      L1-dcache-loads:u         # 3586.133 M/sec                  
       118,531,923      L1-dcache-load-misses:u   #    3.27% of all L1-dcache hits  
         2,394,737      LLC-loads:u               #    2.368 M/sec                  
           421,120      LLC-load-misses:u         #   17.59% of all LL-cache hits   

       1.016862835 seconds time elapsed

 Performance counter stats for './bench --benchmark_filter=.random':

        901.219629      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             1,616      page-faults:u             #    0.002 M/sec                  
     3,113,334,628      cycles:u                  #    3.455 GHz                    
     2,364,562,756      instructions:u            #    0.76  insn per cycle         
       165,446,118      branches:u                #  183.580 M/sec                  
            36,255      branch-misses:u           #    0.02% of all branches        
       616,877,989      L1-dcache-loads:u         #  684.492 M/sec                  
       313,212,057      L1-dcache-load-misses:u   #   50.77% of all L1-dcache hits  
       292,595,618      LLC-loads:u               #  324.666 M/sec                  
        17,022,205      LLC-load-misses:u         #    5.82% of all LL-cache hits   

       0.901669825 seconds time elapsed

 Performance counter stats for './bench --benchmark_filter=bf/4k':

        806.866484      task-clock:u (msec)       #    0.993 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             7,326      page-faults:u             #    0.009 M/sec                  
     2,704,191,301      cycles:u                  #    3.351 GHz                    
     1,455,705,177      instructions:u            #    0.54  insn per cycle         
       165,821,054      branches:u                #  205.512 M/sec                  
           134,859      branch-misses:u           #    0.08% of all branches        
       315,589,142      L1-dcache-loads:u         #  391.129 M/sec                  
        18,431,458      L1-dcache-load-misses:u   #    5.84% of all L1-dcache hits  
        15,961,173      LLC-loads:u               #   19.782 M/sec                  
        12,520,608      LLC-load-misses:u         #   78.44% of all LL-cache hits   

       0.812859649 seconds time elapsed
betatim commented 7 years ago

This is what you get by running perf on all of a big, complex script from khmer:

 Performance counter stats for 'python scripts/abundance-dist-single-threaded.py -s -x 1e8 -N 4 -k 17 -z ecoli_ref-5m.fastq /tmp/test.dist':

      78449.011717      task-clock:u (msec)       #    1.145 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             9,409      page-faults:u             #    0.120 K/sec                  
   261,184,636,220      cycles:u                  #    3.329 GHz                    
   170,206,255,740      instructions:u            #    0.65  insn per cycle         
    35,950,481,141      branches:u                #  458.266 M/sec                  
       592,863,219      branch-misses:u           #    1.65% of all branches        
    60,473,091,195      L1-dcache-loads:u         #  770.859 M/sec                  
     2,261,330,633      L1-dcache-load-misses:u   #    3.74% of all L1-dcache hits  
     1,631,332,334      LLC-loads:u               #   20.795 M/sec                  
     1,585,233,019      LLC-load-misses:u         #   97.17% of all LL-cache hits   

      68.530882400 seconds time elapsed

It is somewhat similar to the emulated BF case, but need to do more thinking and tinkering before I have any ideas on what we learn from this. One thing we do learn is that we execute less than one instruction per CPU cycle 😢

ctb commented 7 years ago

I'm having trouble understanding the output - to me it looks like random has fewer cache misses than linear?! I'm looking at LLC-load-misses:u...