Confirm-Solutions / imprint

The Imprint Project
BSD 3-Clause "New" or "Revised" License
13 stars 3 forks source link

fit_driver_example.py performance smoke test #4

Closed gjmulder closed 2 years ago

gjmulder commented 2 years ago

commit c4a884980340865a73a01795df4489e56963ef51:

#######################################################
# 2 core, 4 hypercore, 3.3GHz Intel i7

$ egrep "model name" /proc/cpuinfo | sort -u
model name  : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz

$ time python ./examples/fit_driver_example.py 
[[0.0358  0.01873 0.01836 ... 0.      0.      0.     ]]
[[0.      0.      0.      ... 0.01456 0.01452 0.01445]]
[[0.01442 0.01437 0.01429 ... 0.03112 0.03361 0.03654]]

real    1m31.970s
user    5m33.231s
sys 0m1.091s

$ time python ./examples/fit_driver_example.py 
[[0.0358  0.01873 0.01836 ... 0.      0.      0.     ]]
[[0.      0.      0.      ... 0.01456 0.01452 0.01445]]
[[0.01442 0.01437 0.01429 ... 0.03112 0.03361 0.03654]]

real    1m29.371s
user    5m40.751s
sys 0m0.524s

#######################################################
# 16 core, 32 hypercore, 3.4GHz AMD Ryzen

$ egrep "model name" /proc/cpuinfo | sort -u
model name  : AMD Ryzen Threadripper 1950X 16-Core Processor

$ time python ./examples/fit_driver_example.py 
[[0.03527 0.01846 0.01816 ... 0.      0.      0.     ]]
[[0.      0.      0.      ... 0.01366 0.01365 0.01361]]
[[0.01357 0.01351 0.01349 ... 0.03041 0.03298 0.03597]]

real    0m10.832s
user    4m49.481s
sys 0m2.209s

$ time python ./examples/fit_driver_example.py 
[[0.03527 0.01846 0.01816 ... 0.      0.      0.     ]]
[[0.      0.      0.      ... 0.01366 0.01365 0.01361]]
[[0.01357 0.01351 0.01349 ... 0.03041 0.03298 0.03597]]

real    0m10.789s
user    4m49.607s
sys 0m2.152s

#######################################################
# 32 core, 64 hypercore, 3GHz AMD Ryzen

$ egrep "model name" /proc/cpuinfo | sort -u
model name  : AMD Ryzen Threadripper 2990WX 32-Core Processor

$ time python ./examples/fit_driver_example.py 
[[3.591e-02 1.868e-02 1.837e-02 ... 1.000e-05 1.000e-05 0.000e+00]]
[[0.      0.      0.      ... 0.0142  0.01419 0.01416]]
[[0.01411 0.01406 0.01403 ... 0.03106 0.03342 0.03624]]

real    0m7.611s
user    5m0.186s
sys 0m6.661s

$ time python ./examples/fit_driver_example.py 
[[3.591e-02 1.868e-02 1.837e-02 ... 1.000e-05 1.000e-05 0.000e+00]]
[[0.      0.      0.      ... 0.0142  0.01419 0.01416]]
[[0.01411 0.01406 0.01403 ... 0.03106 0.03342 0.03624]]

real    0m7.608s
user    5m3.818s
sys 0m6.291s
JamesYang007 commented 2 years ago

Woohoo

JamesYang007 commented 2 years ago

Seems like 16 core, 32 hypercore is kind of a sweet-spot for getting the most bang for buck on each processor based on user time?

gjmulder commented 2 years ago

My 32 core Ryzen is a bit of a quasimodo. I believe it is two 16 core Ryzens glued together sharing the same 16 core memory channel(s). It might however indicate whether on-core performance scales linearly with the raw number of cores. Clearly not, currently.

Also: 16 3.4 GHz versus 32 3GHz.

gjmulder commented 2 years ago

32 core, 64 hypercore, 3GHz AMD Ryzen, 1E8 sim_size:

$ git diff
diff --git a/python/examples/fit_driver_example.py b/python/examples/fit_driver_example.py
index f0a2562..13f226b 100644
--- a/python/examples/fit_driver_example.py
+++ b/python/examples/fit_driver_example.py
@@ -7,7 +7,7 @@ import timeit

 # ========== Toggleable ===============
 n_arms = 3      # prioritize 3 first, then do 4
-sim_size = 100000
+sim_size = 1E8
 n_thetas_1d = 64
 n_threads = os.cpu_count()
 max_batch_size = 64000

$ time python ./examples/fit_driver_example.py 
[[3.501014e-02 1.808173e-02 1.774758e-02 ... 5.300000e-07 5.200000e-07
  4.900000e-07]]
[[4.700000e-07 4.500000e-07 4.200000e-07 ... 1.387540e-02 1.384335e-02
  1.380975e-02]]
[[0.01377296 0.01373336 0.01369075 ... 0.03004029 0.03243357 0.03535721]]

real    86m29.883s
user    5494m43.920s
sys 0m49.086s
JamesYang007 commented 2 years ago

Huh that's interesting - the program scales linearly with sim_size so I would've just expected 100x slower, but this is insanely slower. Sim_size also doesn't change memory allocation amount so it can't be like a cache invalidation thing.

EDIT: nvm I'm an idiot. It should be 1000x slower but this is actually 33% faster than the expected. That's wild af lol

EDIT: user time checks out it's bout 1000x different. The real time is significantly smaller than 1000x hmm..

gjmulder commented 2 years ago

I'm more tinkering to see how stable the benchmarking is, and learning how the code fits on my h/w. Code is looking rock solid from the outside!

32 core, 64 hypercore, 3GHz AMD Ryzen, sim_size = 1E6, n_arms = 4:

$ git diff
diff --git a/python/examples/fit_driver_example.py b/python/examples/fit_driver_example.py
index f0a2562..3166bcd 100644
--- a/python/examples/fit_driver_example.py
+++ b/python/examples/fit_driver_example.py
@@ -6,8 +6,8 @@ import os
 import timeit

 # ========== Toggleable ===============
-n_arms = 3      # prioritize 3 first, then do 4
-sim_size = 100000
+n_arms = 4      # prioritize 3 first, then do 4
+sim_size = 1E6
 n_thetas_1d = 64
 n_threads = os.cpu_count()
 max_batch_size = 64000
top - 23:11:55 up 24 days,  9:19,  2 users,  load average: 58.26, 58.41, 53.50
Tasks: 872 total,  65 running, 492 sleeping,   0 stopped,   0 zombie
%Cpu(s): 97.9 us,  1.7 sy,  0.0 ni,  0.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65881328 total, 48181740 free,  6126788 used, 11572800 buff/cache
KiB Swap: 26843545+total, 26843545+free,        0 used. 59065844 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                             
 61639 mulderg   20   0 2541600 2.141g   5268 R 100.0  3.4   0:21.43 python ./examples/fit_driver_example.py                                                                             
 61659 mulderg   20   0 2541600 2.141g   5268 R 100.0  3.4   0:21.55 python ./examples/fit_driver_example.py                                                                             
 61685 mulderg   20   0 2541600 2.141g   5268 R 100.0  3.4   0:21.23 python ./examples/fit_driver_example.py                                                                             
 61640 mulderg   20   0 2541600 2.141g   5268 R 100.0  3.4   0:21.17 python ./examples/fit_driver_example.py                                                                             
 61642 mulderg   20   0 2541600 2.141g   5268 R 100.0  3.4   0:21.57 python ./examples/fit_driver_example.py    
$ time python ./examples/fit_driver_example.py
[[4.1934e-02 2.8729e-02 2.8318e-02 ... 2.0000e-05 2.0000e-05 1.6000e-05]]
..
[[3.5e-05 3.5e-05 3.6e-05 ... 6.1e-05 5.7e-05 5.6e-05]]
[[1.38e-04 1.38e-04 1.38e-04 ... 2.00e-06 2.00e-06 2.00e-06]]
[[1.0e-05 1.2e-05 1.4e-05 ... 0.0e+00 0.0e+00 0.0e+00]]
[[0.       0.       0.       ... 0.038388 0.040173 0.042355]]

real    80m41.974s
user    4471m47.979s
sys 104m32.731s
JamesYang007 commented 2 years ago

B) though I'll be honest, I'm actually not sure what to read 😳 which parts indicate it's rock solid?

gjmulder commented 2 years ago
$ git diff
diff --git a/python/examples/fit_driver_example.py b/python/examples/fit_driver_example.py
index f0a2562..bd0256f 100644
--- a/python/examples/fit_driver_example.py
+++ b/python/examples/fit_driver_example.py
@@ -7,7 +7,7 @@ import timeit

 # ========== Toggleable ===============
 n_arms = 3      # prioritize 3 first, then do 4
-sim_size = 100000
+sim_size = 1E6
 n_thetas_1d = 64
 n_threads = os.cpu_count()
 max_batch_size = 64000
$ perf stat -d python ./examples/fit_driver_example.py
[[3.5144e-02 1.8238e-02 1.7895e-02 ... 1.0000e-06 1.0000e-06 0.0000e+00]]
[[0.       0.       0.       ... 0.014008 0.013977 0.013935]]
[[0.013899 0.013865 0.013814 ... 0.030403 0.03281  0.035805]]

 Performance counter stats for 'python ./examples/fit_driver_example.py':

    3234148.231872      task-clock (msec)         #   59.745 CPUs utilized          
            93,351      context-switches          #    0.029 K/sec                  
             2,509      cpu-migrations            #    0.001 K/sec                  
         2,493,394      page-faults               #    0.771 K/sec                  
10,420,532,143,891      cycles                    #    3.222 GHz                      (62.50%)
    51,251,324,129      stalled-cycles-frontend   #    0.49% frontend cycles idle     (62.50%)
 2,240,528,360,079      stalled-cycles-backend    #   21.50% backend cycles idle      (62.50%)
15,769,239,963,401      instructions              #    1.51  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (62.50%)
 1,716,957,372,643      branches                  #  530.884 M/sec                    (62.50%)
    14,894,850,349      branch-misses             #    0.87% of all branches          (62.50%)
 4,555,392,000,568      L1-dcache-loads           # 1408.529 M/sec                    (62.50%)
    70,112,129,370      L1-dcache-load-misses     #    1.54% of all L1-dcache hits    (62.50%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

      54.132595974 seconds time elapsed
$ perf stat -d python ./examples/fit_driver_example.py
[[3.5144e-02 1.8238e-02 1.7895e-02 ... 1.0000e-06 1.0000e-06 0.0000e+00]]
[[0.       0.       0.       ... 0.014008 0.013977 0.013935]]
[[0.013899 0.013865 0.013814 ... 0.030403 0.03281  0.035805]]

 Performance counter stats for 'python ./examples/fit_driver_example.py':

    3254490.908910      task-clock (msec)         #   59.617 CPUs utilized          
            99,789      context-switches          #    0.031 K/sec                  
             2,543      cpu-migrations            #    0.001 K/sec                  
         2,501,907      page-faults               #    0.769 K/sec                  
10,449,776,185,505      cycles                    #    3.211 GHz                      (62.50%)
    52,475,738,295      stalled-cycles-frontend   #    0.50% frontend cycles idle     (62.50%)
 2,243,677,564,803      stalled-cycles-backend    #   21.47% backend cycles idle      (62.50%)
15,770,933,860,035      instructions              #    1.51  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (62.50%)
 1,717,186,026,598      branches                  #  527.636 M/sec                    (62.50%)
    14,828,360,464      branch-misses             #    0.86% of all branches          (62.50%)
 4,555,580,279,731      L1-dcache-loads           # 1399.783 M/sec                    (62.50%)
    70,064,650,202      L1-dcache-load-misses     #    1.54% of all L1-dcache hits    (62.50%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

      54.590368445 seconds time elapsed
JamesYang007 commented 2 years ago

God DAMN idk bout you but the branch/cache misses are really fkin low to me

gjmulder commented 2 years ago

32 * 3.4GHz hypercores, 1E6 sims:

$ perf stat -d python ./examples/fit_driver_example.py
[[0.035085 0.01819  0.017818 ... 0.       0.       0.      ]]
[[0.       0.       0.       ... 0.013945 0.013913 0.013879]]
[[0.013835 0.013795 0.013745 ... 0.030187 0.032635 0.035581]]

 Performance counter stats for 'python ./examples/fit_driver_example.py':

    2951151.967878      task-clock (msec)         #   30.677 CPUs utilized          
            47,043      context-switches          #    0.016 K/sec                  
               536      cpu-migrations            #    0.000 K/sec                  
           781,589      page-faults               #    0.265 K/sec                  
10,430,069,052,375      cycles                    #    3.534 GHz                      (62.50%)
    48,118,231,695      stalled-cycles-frontend   #    0.46% frontend cycles idle     (62.50%)
 2,199,289,555,202      stalled-cycles-backend    #   21.09% backend cycles idle      (62.50%)
15,755,215,709,858      instructions              #    1.51  insn per cycle         
                                                  #    0.14  stalled cycles per insn  (62.50%)
 1,714,743,589,563      branches                  #  581.042 M/sec                    (62.50%)
    14,805,909,258      branch-misses             #    0.86% of all branches          (62.50%)
 4,548,830,314,369      L1-dcache-loads           # 1541.374 M/sec                    (62.50%)
    69,930,776,088      L1-dcache-load-misses     #    1.54% of all L1-dcache hits    (62.50%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

      96.199914237 seconds time elapsed
JamesYang007 commented 2 years ago

Curious if you could measure L2/L3? Or is this suggesting that that's not even relevant to look at cuz L1 is being used so efficiently?

gjmulder commented 2 years ago

16 * 3.4GHz hypercores, 1E6 sims, hyperthreading roughly giving us a 20% bump in speed:

$ perf stat -d python ./examples/fit_driver_example.py
[[0.03506  0.018187 0.01781  ... 0.       0.       0.      ]]
[[0.       0.       0.       ... 0.014094 0.014067 0.014031]]
[[0.013997 0.013952 0.013903 ... 0.030358 0.0328   0.035719]]

 Performance counter stats for 'python ./examples/fit_driver_example.py':

    1774101.453944      task-clock (msec)         #   15.682 CPUs utilized          
            17,955      context-switches          #    0.010 K/sec                  
               208      cpu-migrations            #    0.000 K/sec                  
           410,245      page-faults               #    0.231 K/sec                  
 6,543,657,202,637      cycles                    #    3.688 GHz                      (62.50%)
    19,185,813,296      stalled-cycles-frontend   #    0.29% frontend cycles idle     (62.50%)
 4,367,964,539,139      stalled-cycles-backend    #   66.75% backend cycles idle      (62.50%)
15,754,372,323,587      instructions              #    2.41  insn per cycle         
                                                  #    0.28  stalled cycles per insn  (62.50%)
 1,714,354,398,063      branches                  #  966.323 M/sec                    (62.50%)
    14,774,592,753      branch-misses             #    0.86% of all branches          (62.50%)
 4,600,032,669,607      L1-dcache-loads           # 2592.880 M/sec                    (62.50%)
    67,507,029,033      L1-dcache-load-misses     #    1.47% of all L1-dcache hits    (62.50%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

     113.127529066 seconds time elapsed                                       
JamesYang007 commented 2 years ago

Would you say this is as optimal as we're gonna get at the process-level? Only further optimizations we can do are smarter batching and actually parallelizing each batch.

To be clear, what you're measuring right now is a driver that's sequentially batching 64000 points, but processing each batch on the same machine. Processing of a batch is "process-level".

gjmulder commented 2 years ago

Curious if you could measure L2/L3? Or is this suggesting that that's not even relevant to look at cuz L1 is being used so efficiently?

perf doesn't seem to provide L2/L3 stats. I'd need to learn how to use AMD's uProf.

gjmulder commented 2 years ago

Would you say this is as optimal as we're gonna get at the process-level? Only further optimizations we can do are smarter batching and actually parallelizing each batch.

To be clear, what you're measuring right now is a driver that's sequentially batching 64000 points, but processing each batch on the same machine. Processing of a batch is "process-level".

Right. I haven't had time to dig into the code. Just throwing some raw numbers out to ensure I have reproducible stable benchmarks.

JamesYang007 commented 2 years ago

Would you say this is as optimal as we're gonna get at the process-level? Only further optimizations we can do are smarter batching and actually parallelizing each batch. To be clear, what you're measuring right now is a driver that's sequentially batching 64000 points, but processing each batch on the same machine. Processing of a batch is "process-level".

Right. I haven't had time to dig into the code. Just throwing some raw numbers out to ensure I have reproducible stable benchmarks.

No problem - lookin forward to hearin your thoughts on the code once you get a chance to look at them

gjmulder commented 2 years ago

Upgraded to kernel 5.4 and now have a lot more AMD perf stats. perf list for my AMD attached:

perf_list_amd.txt

$ perf stat -e `cat perf_amd_e_flags.txt` python ./examples/fit_driver_example.py
[[3.5144e-02 1.8238e-02 1.7895e-02 ... 1.0000e-06 1.0000e-06 0.0000e+00]]
[[0.       0.       0.       ... 0.014008 0.013977 0.013935]]
[[0.013899 0.013865 0.013814 ... 0.030403 0.03281  0.035805]]

 Performance counter stats for 'python ./examples/fit_driver_example.py':

   421,138,617,622      bp_l1_btb_correct                                             (4.27%)
    37,080,464,771      bp_l2_btb_correct                                             (4.28%)
       232,783,219      bp_l1_tlb_miss_l2_hit                                         (4.28%)
         8,586,092      bp_l1_tlb_miss_l2_miss                                        (4.28%)
        61,190,417      bp_snp_re_sync                                                (4.28%)
         8,710,898      bp_tlb_rel                                                    (4.28%)
     1,252,488,811      ic_cache_fill_l2                                              (4.28%)
     1,490,365,715      ic_cache_fill_sys                                             (4.28%)
    10,720,417,920      ic_cache_inval.fill_invalidated                                     (4.28%)
       164,760,224      ic_cache_inval.l2_invalidating_probe                                     (4.28%)
 6,775,831,762,225      ic_fetch_stall.ic_stall_any                                     (4.28%)
 2,240,819,302,449      ic_fetch_stall.ic_stall_back_pressure                                     (4.28%)
    61,544,040,884      ic_fetch_stall.ic_stall_dq_empty                                     (4.28%)
    27,867,174,745      ic_fw32                                                       (4.28%)
     1,423,893,063      ic_fw32_miss                                                  (4.28%)
       781,759,759      l2_cache_req_stat.ic_fill_hit_s                                     (4.28%)
            80,861      l2_cache_req_stat.ic_fill_hit_x                                     (4.28%)
     1,928,000,391      l2_cache_req_stat.ic_fill_miss                                     (4.28%)
     4,155,719,593      l2_cache_req_stat.ls_rd_blk_c                                     (4.28%)
       364,752,139      l2_cache_req_stat.ls_rd_blk_cs                                     (4.28%)
     4,979,085,335      l2_cache_req_stat.ls_rd_blk_l_hit_s                                     (4.28%)
    61,199,571,756      l2_cache_req_stat.ls_rd_blk_l_hit_x                                     (4.28%)
         4,597,034      l2_cache_req_stat.ls_rd_blk_x                                     (4.28%)
 3,784,715,147,816      l2_fill_pending.l2_fill_busy                                     (4.28%)
   946,393,360,922      l2_latency.l2_cycles_waiting_on_fills                                     (4.28%)
     2,574,386,902      l2_request_g1.cacheable_ic_read                                     (4.28%)
         8,997,807      l2_request_g1.change_to_x                                     (4.28%)
    73,215,177,868      l2_request_g1.l2_hw_pf                                        (4.28%)
       780,214,674      l2_request_g1.ls_rd_blk_c_s                                     (4.28%)
        95,079,420      l2_request_g1.other_requests                                     (4.28%)
                 0      l2_request_g1.prefetch_l2                                     (4.28%)
    69,924,266,451      l2_request_g1.rd_blk_l                                        (4.28%)
        22,731,771      l2_request_g1.rd_blk_x                                        (4.28%)
                 0      l2_request_g2.bus_locks_originator                                     (4.28%)
                 0      l2_request_g2.bus_locks_responses                                     (4.28%)
   145,665,691,726      l2_request_g2.group1                                          (4.28%)
                 0      l2_request_g2.ic_rd_sized                                     (4.28%)
                 0      l2_request_g2.ic_rd_sized_nc                                     (4.28%)
                 0      l2_request_g2.ls_rd_sized                                     (4.28%)
           865,394      l2_request_g2.ls_rd_sized_nc                                     (4.27%)
        87,121,757      l2_request_g2.smc_inval                                       (4.28%)
                 0      l2_wcb_req.cl_zero                                            (4.28%)
        83,500,463      l2_wcb_req.wcb_close                                          (4.28%)
       403,378,293      l2_wcb_req.wcb_write                                          (4.28%)
                 0      l2_wcb_req.zero_byte_store                                     (4.28%)
   <not supported>      l3_comb_clstr_state.other_l3_miss_typs                                   
   <not supported>      l3_comb_clstr_state.request_miss                                   
   <not supported>      l3_lookup_state.all_l3_req_typs                                   
   <not supported>      l3_request_g1.caching_l3_cache_accesses                                   
     1,876,846,382      ex_div_busy                                                   (4.27%)
        54,358,123      ex_div_count                                                  (4.27%)
 1,717,933,088,817      ex_ret_brn                                                    (4.27%)
         7,232,537      ex_ret_brn_far                                                (4.27%)
        51,139,866      ex_ret_brn_ind_misp                                           (4.28%)
    14,853,093,601      ex_ret_brn_misp                                               (4.28%)
         7,966,962      ex_ret_brn_resync                                             (4.28%)
 1,129,779,510,658      ex_ret_brn_tkn                                                (4.28%)
     7,046,224,506      ex_ret_brn_tkn_misp                                           (4.28%)
 1,526,614,937,949      ex_ret_cond                                                   (4.28%)
                 0      ex_ret_cond_misp                                              (4.28%)
15,463,315,661,427      ex_ret_cops                                                   (4.28%)
 1,271,112,618,508      ex_ret_fus_brnch_inst                                         (4.28%)
15,810,491,707,583      ex_ret_instr                                                  (4.28%)
                 0      ex_ret_mmx_fp_instr.mmx_instr                                     (4.28%)
 2,666,308,923,582      ex_ret_mmx_fp_instr.sse_instr                                     (4.28%)
    13,546,192,027      ex_ret_mmx_fp_instr.x87_instr                                     (4.28%)
     3,373,673,697      ex_ret_near_ret                                               (4.28%)
        10,258,239      ex_ret_near_ret_mispred                                       (4.28%)
                 0      ex_tagged_ibs_ops.ibs_count_rollover                                     (4.28%)
                 0      ex_tagged_ibs_ops.ibs_tagged_ops                                     (4.28%)
                 0      ex_tagged_ibs_ops.ibs_tagged_ops_ret                                     (4.28%)
   422,169,337,653      fp_num_mov_elim_scal_op.opt_potential                                     (4.28%)
   320,139,369,680      fp_num_mov_elim_scal_op.optimized                                     (4.28%)
   415,489,921,540      fp_num_mov_elim_scal_op.sse_mov_ops                                     (4.28%)
   415,348,659,844      fp_num_mov_elim_scal_op.sse_mov_ops_elim                                     (4.28%)
 1,141,936,195,622      fp_ret_sse_avx_ops.all                                        (3.42%)
   419,662,546,268      fp_ret_sse_avx_ops.dp_add_sub_flops                                     (2.57%)
   409,126,524,137      fp_ret_sse_avx_ops.dp_div_flops                                     (1.71%)
                 0      fp_ret_sse_avx_ops.dp_mult_add_flops                                     (1.71%)
   314,659,565,707      fp_ret_sse_avx_ops.dp_mult_flops                                     (1.71%)
                 0      fp_ret_sse_avx_ops.sp_add_sub_flops                                     (1.71%)
                 0      fp_ret_sse_avx_ops.sp_div_flops                                     (1.71%)
                 0      fp_ret_sse_avx_ops.sp_mult_add_flops                                     (1.71%)
                 0      fp_ret_sse_avx_ops.sp_mult_flops                                     (1.71%)
             1,480      fp_retired_ser_ops.sse_bot_ret                                     (2.57%)
                 0      fp_retired_ser_ops.sse_ctrl_ret                                     (2.56%)
             7,369      fp_retired_ser_ops.x87_bot_ret                                     (3.42%)
                 0      fp_retired_ser_ops.x87_ctrl_ret                                     (3.42%)
                 0      fp_retx87_fp_ops.add_sub_ops                                     (4.27%)
     4,497,859,808      fp_retx87_fp_ops.all                                          (4.27%)
               140      fp_retx87_fp_ops.div_sqr_r_ops                                     (4.27%)
     4,498,520,699      fp_retx87_fp_ops.mul_ops                                      (4.27%)
 3,214,423,022,698      fp_sched_empty                                                (4.27%)
 1,197,478,588,295      fpu_pipe_assignment.dual                                      (4.27%)
 2,013,921,233,227      fpu_pipe_assignment.total                                     (4.27%)
 4,553,968,296,327      ls_dc_accesses                                                (4.27%)
 3,987,055,505,739      ls_dispatch.ld_dispatch                                       (4.26%)
     5,092,075,976      ls_dispatch.ld_st_dispatch                                     (4.26%)
   612,062,067,350      ls_dispatch.store_dispatch                                     (4.26%)
        15,416,912      ls_inef_sw_pref.data_pipe_sw_pf_dc_hit                                     (4.26%)
         1,798,135      ls_inef_sw_pref.mab_mch_cnt                                     (4.26%)
     1,593,250,705      ls_l1_d_tlb_miss.all                                          (4.26%)
                 0      ls_l1_d_tlb_miss.tlb_reload_1g_l2_hit                                     (4.26%)
           251,415      ls_l1_d_tlb_miss.tlb_reload_1g_l2_miss                                     (4.26%)
         7,162,757      ls_l1_d_tlb_miss.tlb_reload_2m_l2_hit                                     (4.26%)
         2,615,872      ls_l1_d_tlb_miss.tlb_reload_2m_l2_miss                                     (4.26%)
        19,762,868      ls_l1_d_tlb_miss.tlb_reload_32k_l2_hit                                     (4.26%)
         4,919,013      ls_l1_d_tlb_miss.tlb_reload_32k_l2_miss                                     (4.26%)
     1,316,182,194      ls_l1_d_tlb_miss.tlb_reload_4k_l2_hit                                     (4.25%)
       241,622,321      ls_l1_d_tlb_miss.tlb_reload_4k_l2_miss                                     (4.25%)
                 0      ls_locks.bus_lock                                             (4.25%)
        63,601,020      ls_misal_accesses                                             (4.25%)
10,504,643,626,441      ls_not_halted_cyc                                             (4.25%)
        14,006,169      ls_pref_instr_disp.load_prefetch_w                                     (4.26%)
         6,863,978      ls_pref_instr_disp.prefetch_nta                                     (4.26%)
        12,891,159      ls_pref_instr_disp.store_prefetch_w                                     (4.26%)
    30,393,208,229      ls_stlf                                                       (4.26%)
       269,287,215      ls_tablewalker.perf_mon_tablewalk_alloc_dside                                     (4.26%)
        17,352,694      ls_tablewalker.perf_mon_tablewalk_alloc_iside                                     (4.26%)
    15,005,647,155      ic_oc_mode_switch.ic_oc_mode_switch                                     (4.27%)
    15,001,835,182      ic_oc_mode_switch.oc_ic_mode_switch                                     (4.27%)

      57.094810950 seconds time elapsed

    3353.617860000 seconds user
       6.296476000 seconds sys
JamesYang007 commented 2 years ago

What does the % mean? They don't seem to add up to 100% 👀

gjmulder commented 2 years ago

What does the % mean? They don't seem to add up to 100% eyes

https://stackoverflow.com/questions/33679408/perf-what-do-n-percent-records-mean-in-perf-stat-output

gjmulder commented 2 years ago

I'd like to create some non-functional performance metrics that we can plot per commit to monitor performance regressions, e.g. ratio of hits to misses for branches, L1, L2 hits to misses, AVX versus non AVX instructions, etc. Suggestions and ideas wanted!

brnch_pred_hit_ratio = bp_l1_tlb_miss_l2_hit / bp_l1_tlb_miss_l2_miss
sse_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.sse_instr
x87_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.x87_instr
avx_instr_ratio = ex_ret_instr / fp_ret_sse_avx_ops.all 
avx_div_ratio = fp_ret_sse_avx_ops.all   / fp_ret_sse_avx_ops.dp_div_flops 
sims_wallclock_secs = sim_size / seconds time elapsed
sims_user_secs = sim_size / seconds user
sims_sys_secs = sim_size / seconds sys
JamesYang007 commented 2 years ago

I'd like to create some non-functional performance metrics that we can plot per commit to monitor performance regressions, e.g. ratio of hits to misses for branches, L1, L2 hits to misses, AVX versus non AVX instructions, etc. Suggestions and ideas wanted!

brnch_pred_hit_ratio = bp_l1_tlb_miss_l2_hit / bp_l1_tlb_miss_l2_miss
sse_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.sse_instr
x87_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.x87_instr
avx_instr_ratio = ex_ret_instr / fp_ret_sse_avx_ops.all 
avx_div_ratio = fp_ret_sse_avx_ops.all   / fp_ret_sse_avx_ops.dp_div_flops 
sims_wallclock_secs = sim_size / seconds time elapsed
sims_user_secs = sim_size / seconds user
sims_sys_secs = sim_size / seconds sys

Looks great! The only comment I have is that the sim_size in the future won't be uniform, so the sims_per_sec metric may change for future examples. Each gridpoint can have a different sim_size but for simplicity I made the sim_size the same for all gridpoints in this script.

gjmulder commented 2 years ago

Done