Closed gjmulder closed 2 years ago
Woohoo
Seems like 16 core, 32 hypercore is kind of a sweet-spot for getting the most bang for buck on each processor based on user time?
My 32 core Ryzen is a bit of a quasimodo. I believe it is two 16 core Ryzens glued together sharing the same 16 core memory channel(s). It might however indicate whether on-core performance scales linearly with the raw number of cores. Clearly not, currently.
Also: 16 3.4 GHz versus 32 3GHz.
32 core, 64 hypercore, 3GHz AMD Ryzen, 1E8 sim_size:
$ git diff
diff --git a/python/examples/fit_driver_example.py b/python/examples/fit_driver_example.py
index f0a2562..13f226b 100644
--- a/python/examples/fit_driver_example.py
+++ b/python/examples/fit_driver_example.py
@@ -7,7 +7,7 @@ import timeit
# ========== Toggleable ===============
n_arms = 3 # prioritize 3 first, then do 4
-sim_size = 100000
+sim_size = 1E8
n_thetas_1d = 64
n_threads = os.cpu_count()
max_batch_size = 64000
$ time python ./examples/fit_driver_example.py
[[3.501014e-02 1.808173e-02 1.774758e-02 ... 5.300000e-07 5.200000e-07
4.900000e-07]]
[[4.700000e-07 4.500000e-07 4.200000e-07 ... 1.387540e-02 1.384335e-02
1.380975e-02]]
[[0.01377296 0.01373336 0.01369075 ... 0.03004029 0.03243357 0.03535721]]
real 86m29.883s
user 5494m43.920s
sys 0m49.086s
Huh that's interesting - the program scales linearly with sim_size so I would've just expected 100x slower, but this is insanely slower. Sim_size also doesn't change memory allocation amount so it can't be like a cache invalidation thing.
EDIT: nvm I'm an idiot. It should be 1000x slower but this is actually 33% faster than the expected. That's wild af lol
EDIT: user time checks out it's bout 1000x different. The real time is significantly smaller than 1000x hmm..
I'm more tinkering to see how stable the benchmarking is, and learning how the code fits on my h/w. Code is looking rock solid from the outside!
32 core, 64 hypercore, 3GHz AMD Ryzen, sim_size = 1E6, n_arms = 4:
$ git diff
diff --git a/python/examples/fit_driver_example.py b/python/examples/fit_driver_example.py
index f0a2562..3166bcd 100644
--- a/python/examples/fit_driver_example.py
+++ b/python/examples/fit_driver_example.py
@@ -6,8 +6,8 @@ import os
import timeit
# ========== Toggleable ===============
-n_arms = 3 # prioritize 3 first, then do 4
-sim_size = 100000
+n_arms = 4 # prioritize 3 first, then do 4
+sim_size = 1E6
n_thetas_1d = 64
n_threads = os.cpu_count()
max_batch_size = 64000
top - 23:11:55 up 24 days, 9:19, 2 users, load average: 58.26, 58.41, 53.50
Tasks: 872 total, 65 running, 492 sleeping, 0 stopped, 0 zombie
%Cpu(s): 97.9 us, 1.7 sy, 0.0 ni, 0.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65881328 total, 48181740 free, 6126788 used, 11572800 buff/cache
KiB Swap: 26843545+total, 26843545+free, 0 used. 59065844 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
61639 mulderg 20 0 2541600 2.141g 5268 R 100.0 3.4 0:21.43 python ./examples/fit_driver_example.py
61659 mulderg 20 0 2541600 2.141g 5268 R 100.0 3.4 0:21.55 python ./examples/fit_driver_example.py
61685 mulderg 20 0 2541600 2.141g 5268 R 100.0 3.4 0:21.23 python ./examples/fit_driver_example.py
61640 mulderg 20 0 2541600 2.141g 5268 R 100.0 3.4 0:21.17 python ./examples/fit_driver_example.py
61642 mulderg 20 0 2541600 2.141g 5268 R 100.0 3.4 0:21.57 python ./examples/fit_driver_example.py
$ time python ./examples/fit_driver_example.py
[[4.1934e-02 2.8729e-02 2.8318e-02 ... 2.0000e-05 2.0000e-05 1.6000e-05]]
..
[[3.5e-05 3.5e-05 3.6e-05 ... 6.1e-05 5.7e-05 5.6e-05]]
[[1.38e-04 1.38e-04 1.38e-04 ... 2.00e-06 2.00e-06 2.00e-06]]
[[1.0e-05 1.2e-05 1.4e-05 ... 0.0e+00 0.0e+00 0.0e+00]]
[[0. 0. 0. ... 0.038388 0.040173 0.042355]]
real 80m41.974s
user 4471m47.979s
sys 104m32.731s
B) though I'll be honest, I'm actually not sure what to read 😳 which parts indicate it's rock solid?
$ git diff
diff --git a/python/examples/fit_driver_example.py b/python/examples/fit_driver_example.py
index f0a2562..bd0256f 100644
--- a/python/examples/fit_driver_example.py
+++ b/python/examples/fit_driver_example.py
@@ -7,7 +7,7 @@ import timeit
# ========== Toggleable ===============
n_arms = 3 # prioritize 3 first, then do 4
-sim_size = 100000
+sim_size = 1E6
n_thetas_1d = 64
n_threads = os.cpu_count()
max_batch_size = 64000
$ perf stat -d python ./examples/fit_driver_example.py
[[3.5144e-02 1.8238e-02 1.7895e-02 ... 1.0000e-06 1.0000e-06 0.0000e+00]]
[[0. 0. 0. ... 0.014008 0.013977 0.013935]]
[[0.013899 0.013865 0.013814 ... 0.030403 0.03281 0.035805]]
Performance counter stats for 'python ./examples/fit_driver_example.py':
3234148.231872 task-clock (msec) # 59.745 CPUs utilized
93,351 context-switches # 0.029 K/sec
2,509 cpu-migrations # 0.001 K/sec
2,493,394 page-faults # 0.771 K/sec
10,420,532,143,891 cycles # 3.222 GHz (62.50%)
51,251,324,129 stalled-cycles-frontend # 0.49% frontend cycles idle (62.50%)
2,240,528,360,079 stalled-cycles-backend # 21.50% backend cycles idle (62.50%)
15,769,239,963,401 instructions # 1.51 insn per cycle
# 0.14 stalled cycles per insn (62.50%)
1,716,957,372,643 branches # 530.884 M/sec (62.50%)
14,894,850,349 branch-misses # 0.87% of all branches (62.50%)
4,555,392,000,568 L1-dcache-loads # 1408.529 M/sec (62.50%)
70,112,129,370 L1-dcache-load-misses # 1.54% of all L1-dcache hits (62.50%)
<not supported> LLC-loads
<not supported> LLC-load-misses
54.132595974 seconds time elapsed
$ perf stat -d python ./examples/fit_driver_example.py
[[3.5144e-02 1.8238e-02 1.7895e-02 ... 1.0000e-06 1.0000e-06 0.0000e+00]]
[[0. 0. 0. ... 0.014008 0.013977 0.013935]]
[[0.013899 0.013865 0.013814 ... 0.030403 0.03281 0.035805]]
Performance counter stats for 'python ./examples/fit_driver_example.py':
3254490.908910 task-clock (msec) # 59.617 CPUs utilized
99,789 context-switches # 0.031 K/sec
2,543 cpu-migrations # 0.001 K/sec
2,501,907 page-faults # 0.769 K/sec
10,449,776,185,505 cycles # 3.211 GHz (62.50%)
52,475,738,295 stalled-cycles-frontend # 0.50% frontend cycles idle (62.50%)
2,243,677,564,803 stalled-cycles-backend # 21.47% backend cycles idle (62.50%)
15,770,933,860,035 instructions # 1.51 insn per cycle
# 0.14 stalled cycles per insn (62.50%)
1,717,186,026,598 branches # 527.636 M/sec (62.50%)
14,828,360,464 branch-misses # 0.86% of all branches (62.50%)
4,555,580,279,731 L1-dcache-loads # 1399.783 M/sec (62.50%)
70,064,650,202 L1-dcache-load-misses # 1.54% of all L1-dcache hits (62.50%)
<not supported> LLC-loads
<not supported> LLC-load-misses
54.590368445 seconds time elapsed
God DAMN idk bout you but the branch/cache misses are really fkin low to me
32 * 3.4GHz hypercores, 1E6 sims:
$ perf stat -d python ./examples/fit_driver_example.py
[[0.035085 0.01819 0.017818 ... 0. 0. 0. ]]
[[0. 0. 0. ... 0.013945 0.013913 0.013879]]
[[0.013835 0.013795 0.013745 ... 0.030187 0.032635 0.035581]]
Performance counter stats for 'python ./examples/fit_driver_example.py':
2951151.967878 task-clock (msec) # 30.677 CPUs utilized
47,043 context-switches # 0.016 K/sec
536 cpu-migrations # 0.000 K/sec
781,589 page-faults # 0.265 K/sec
10,430,069,052,375 cycles # 3.534 GHz (62.50%)
48,118,231,695 stalled-cycles-frontend # 0.46% frontend cycles idle (62.50%)
2,199,289,555,202 stalled-cycles-backend # 21.09% backend cycles idle (62.50%)
15,755,215,709,858 instructions # 1.51 insn per cycle
# 0.14 stalled cycles per insn (62.50%)
1,714,743,589,563 branches # 581.042 M/sec (62.50%)
14,805,909,258 branch-misses # 0.86% of all branches (62.50%)
4,548,830,314,369 L1-dcache-loads # 1541.374 M/sec (62.50%)
69,930,776,088 L1-dcache-load-misses # 1.54% of all L1-dcache hits (62.50%)
<not supported> LLC-loads
<not supported> LLC-load-misses
96.199914237 seconds time elapsed
Curious if you could measure L2/L3? Or is this suggesting that that's not even relevant to look at cuz L1 is being used so efficiently?
16 * 3.4GHz hypercores, 1E6 sims, hyperthreading roughly giving us a 20% bump in speed:
$ perf stat -d python ./examples/fit_driver_example.py
[[0.03506 0.018187 0.01781 ... 0. 0. 0. ]]
[[0. 0. 0. ... 0.014094 0.014067 0.014031]]
[[0.013997 0.013952 0.013903 ... 0.030358 0.0328 0.035719]]
Performance counter stats for 'python ./examples/fit_driver_example.py':
1774101.453944 task-clock (msec) # 15.682 CPUs utilized
17,955 context-switches # 0.010 K/sec
208 cpu-migrations # 0.000 K/sec
410,245 page-faults # 0.231 K/sec
6,543,657,202,637 cycles # 3.688 GHz (62.50%)
19,185,813,296 stalled-cycles-frontend # 0.29% frontend cycles idle (62.50%)
4,367,964,539,139 stalled-cycles-backend # 66.75% backend cycles idle (62.50%)
15,754,372,323,587 instructions # 2.41 insn per cycle
# 0.28 stalled cycles per insn (62.50%)
1,714,354,398,063 branches # 966.323 M/sec (62.50%)
14,774,592,753 branch-misses # 0.86% of all branches (62.50%)
4,600,032,669,607 L1-dcache-loads # 2592.880 M/sec (62.50%)
67,507,029,033 L1-dcache-load-misses # 1.47% of all L1-dcache hits (62.50%)
<not supported> LLC-loads
<not supported> LLC-load-misses
113.127529066 seconds time elapsed
Would you say this is as optimal as we're gonna get at the process-level? Only further optimizations we can do are smarter batching and actually parallelizing each batch.
To be clear, what you're measuring right now is a driver that's sequentially batching 64000 points, but processing each batch on the same machine. Processing of a batch is "process-level".
Curious if you could measure L2/L3? Or is this suggesting that that's not even relevant to look at cuz L1 is being used so efficiently?
perf doesn't seem to provide L2/L3 stats. I'd need to learn how to use AMD's uProf.
Would you say this is as optimal as we're gonna get at the process-level? Only further optimizations we can do are smarter batching and actually parallelizing each batch.
To be clear, what you're measuring right now is a driver that's sequentially batching 64000 points, but processing each batch on the same machine. Processing of a batch is "process-level".
Right. I haven't had time to dig into the code. Just throwing some raw numbers out to ensure I have reproducible stable benchmarks.
Would you say this is as optimal as we're gonna get at the process-level? Only further optimizations we can do are smarter batching and actually parallelizing each batch. To be clear, what you're measuring right now is a driver that's sequentially batching 64000 points, but processing each batch on the same machine. Processing of a batch is "process-level".
Right. I haven't had time to dig into the code. Just throwing some raw numbers out to ensure I have reproducible stable benchmarks.
No problem - lookin forward to hearin your thoughts on the code once you get a chance to look at them
Upgraded to kernel 5.4 and now have a lot more AMD perf stats. perf list
for my AMD attached:
$ perf stat -e `cat perf_amd_e_flags.txt` python ./examples/fit_driver_example.py
[[3.5144e-02 1.8238e-02 1.7895e-02 ... 1.0000e-06 1.0000e-06 0.0000e+00]]
[[0. 0. 0. ... 0.014008 0.013977 0.013935]]
[[0.013899 0.013865 0.013814 ... 0.030403 0.03281 0.035805]]
Performance counter stats for 'python ./examples/fit_driver_example.py':
421,138,617,622 bp_l1_btb_correct (4.27%)
37,080,464,771 bp_l2_btb_correct (4.28%)
232,783,219 bp_l1_tlb_miss_l2_hit (4.28%)
8,586,092 bp_l1_tlb_miss_l2_miss (4.28%)
61,190,417 bp_snp_re_sync (4.28%)
8,710,898 bp_tlb_rel (4.28%)
1,252,488,811 ic_cache_fill_l2 (4.28%)
1,490,365,715 ic_cache_fill_sys (4.28%)
10,720,417,920 ic_cache_inval.fill_invalidated (4.28%)
164,760,224 ic_cache_inval.l2_invalidating_probe (4.28%)
6,775,831,762,225 ic_fetch_stall.ic_stall_any (4.28%)
2,240,819,302,449 ic_fetch_stall.ic_stall_back_pressure (4.28%)
61,544,040,884 ic_fetch_stall.ic_stall_dq_empty (4.28%)
27,867,174,745 ic_fw32 (4.28%)
1,423,893,063 ic_fw32_miss (4.28%)
781,759,759 l2_cache_req_stat.ic_fill_hit_s (4.28%)
80,861 l2_cache_req_stat.ic_fill_hit_x (4.28%)
1,928,000,391 l2_cache_req_stat.ic_fill_miss (4.28%)
4,155,719,593 l2_cache_req_stat.ls_rd_blk_c (4.28%)
364,752,139 l2_cache_req_stat.ls_rd_blk_cs (4.28%)
4,979,085,335 l2_cache_req_stat.ls_rd_blk_l_hit_s (4.28%)
61,199,571,756 l2_cache_req_stat.ls_rd_blk_l_hit_x (4.28%)
4,597,034 l2_cache_req_stat.ls_rd_blk_x (4.28%)
3,784,715,147,816 l2_fill_pending.l2_fill_busy (4.28%)
946,393,360,922 l2_latency.l2_cycles_waiting_on_fills (4.28%)
2,574,386,902 l2_request_g1.cacheable_ic_read (4.28%)
8,997,807 l2_request_g1.change_to_x (4.28%)
73,215,177,868 l2_request_g1.l2_hw_pf (4.28%)
780,214,674 l2_request_g1.ls_rd_blk_c_s (4.28%)
95,079,420 l2_request_g1.other_requests (4.28%)
0 l2_request_g1.prefetch_l2 (4.28%)
69,924,266,451 l2_request_g1.rd_blk_l (4.28%)
22,731,771 l2_request_g1.rd_blk_x (4.28%)
0 l2_request_g2.bus_locks_originator (4.28%)
0 l2_request_g2.bus_locks_responses (4.28%)
145,665,691,726 l2_request_g2.group1 (4.28%)
0 l2_request_g2.ic_rd_sized (4.28%)
0 l2_request_g2.ic_rd_sized_nc (4.28%)
0 l2_request_g2.ls_rd_sized (4.28%)
865,394 l2_request_g2.ls_rd_sized_nc (4.27%)
87,121,757 l2_request_g2.smc_inval (4.28%)
0 l2_wcb_req.cl_zero (4.28%)
83,500,463 l2_wcb_req.wcb_close (4.28%)
403,378,293 l2_wcb_req.wcb_write (4.28%)
0 l2_wcb_req.zero_byte_store (4.28%)
<not supported> l3_comb_clstr_state.other_l3_miss_typs
<not supported> l3_comb_clstr_state.request_miss
<not supported> l3_lookup_state.all_l3_req_typs
<not supported> l3_request_g1.caching_l3_cache_accesses
1,876,846,382 ex_div_busy (4.27%)
54,358,123 ex_div_count (4.27%)
1,717,933,088,817 ex_ret_brn (4.27%)
7,232,537 ex_ret_brn_far (4.27%)
51,139,866 ex_ret_brn_ind_misp (4.28%)
14,853,093,601 ex_ret_brn_misp (4.28%)
7,966,962 ex_ret_brn_resync (4.28%)
1,129,779,510,658 ex_ret_brn_tkn (4.28%)
7,046,224,506 ex_ret_brn_tkn_misp (4.28%)
1,526,614,937,949 ex_ret_cond (4.28%)
0 ex_ret_cond_misp (4.28%)
15,463,315,661,427 ex_ret_cops (4.28%)
1,271,112,618,508 ex_ret_fus_brnch_inst (4.28%)
15,810,491,707,583 ex_ret_instr (4.28%)
0 ex_ret_mmx_fp_instr.mmx_instr (4.28%)
2,666,308,923,582 ex_ret_mmx_fp_instr.sse_instr (4.28%)
13,546,192,027 ex_ret_mmx_fp_instr.x87_instr (4.28%)
3,373,673,697 ex_ret_near_ret (4.28%)
10,258,239 ex_ret_near_ret_mispred (4.28%)
0 ex_tagged_ibs_ops.ibs_count_rollover (4.28%)
0 ex_tagged_ibs_ops.ibs_tagged_ops (4.28%)
0 ex_tagged_ibs_ops.ibs_tagged_ops_ret (4.28%)
422,169,337,653 fp_num_mov_elim_scal_op.opt_potential (4.28%)
320,139,369,680 fp_num_mov_elim_scal_op.optimized (4.28%)
415,489,921,540 fp_num_mov_elim_scal_op.sse_mov_ops (4.28%)
415,348,659,844 fp_num_mov_elim_scal_op.sse_mov_ops_elim (4.28%)
1,141,936,195,622 fp_ret_sse_avx_ops.all (3.42%)
419,662,546,268 fp_ret_sse_avx_ops.dp_add_sub_flops (2.57%)
409,126,524,137 fp_ret_sse_avx_ops.dp_div_flops (1.71%)
0 fp_ret_sse_avx_ops.dp_mult_add_flops (1.71%)
314,659,565,707 fp_ret_sse_avx_ops.dp_mult_flops (1.71%)
0 fp_ret_sse_avx_ops.sp_add_sub_flops (1.71%)
0 fp_ret_sse_avx_ops.sp_div_flops (1.71%)
0 fp_ret_sse_avx_ops.sp_mult_add_flops (1.71%)
0 fp_ret_sse_avx_ops.sp_mult_flops (1.71%)
1,480 fp_retired_ser_ops.sse_bot_ret (2.57%)
0 fp_retired_ser_ops.sse_ctrl_ret (2.56%)
7,369 fp_retired_ser_ops.x87_bot_ret (3.42%)
0 fp_retired_ser_ops.x87_ctrl_ret (3.42%)
0 fp_retx87_fp_ops.add_sub_ops (4.27%)
4,497,859,808 fp_retx87_fp_ops.all (4.27%)
140 fp_retx87_fp_ops.div_sqr_r_ops (4.27%)
4,498,520,699 fp_retx87_fp_ops.mul_ops (4.27%)
3,214,423,022,698 fp_sched_empty (4.27%)
1,197,478,588,295 fpu_pipe_assignment.dual (4.27%)
2,013,921,233,227 fpu_pipe_assignment.total (4.27%)
4,553,968,296,327 ls_dc_accesses (4.27%)
3,987,055,505,739 ls_dispatch.ld_dispatch (4.26%)
5,092,075,976 ls_dispatch.ld_st_dispatch (4.26%)
612,062,067,350 ls_dispatch.store_dispatch (4.26%)
15,416,912 ls_inef_sw_pref.data_pipe_sw_pf_dc_hit (4.26%)
1,798,135 ls_inef_sw_pref.mab_mch_cnt (4.26%)
1,593,250,705 ls_l1_d_tlb_miss.all (4.26%)
0 ls_l1_d_tlb_miss.tlb_reload_1g_l2_hit (4.26%)
251,415 ls_l1_d_tlb_miss.tlb_reload_1g_l2_miss (4.26%)
7,162,757 ls_l1_d_tlb_miss.tlb_reload_2m_l2_hit (4.26%)
2,615,872 ls_l1_d_tlb_miss.tlb_reload_2m_l2_miss (4.26%)
19,762,868 ls_l1_d_tlb_miss.tlb_reload_32k_l2_hit (4.26%)
4,919,013 ls_l1_d_tlb_miss.tlb_reload_32k_l2_miss (4.26%)
1,316,182,194 ls_l1_d_tlb_miss.tlb_reload_4k_l2_hit (4.25%)
241,622,321 ls_l1_d_tlb_miss.tlb_reload_4k_l2_miss (4.25%)
0 ls_locks.bus_lock (4.25%)
63,601,020 ls_misal_accesses (4.25%)
10,504,643,626,441 ls_not_halted_cyc (4.25%)
14,006,169 ls_pref_instr_disp.load_prefetch_w (4.26%)
6,863,978 ls_pref_instr_disp.prefetch_nta (4.26%)
12,891,159 ls_pref_instr_disp.store_prefetch_w (4.26%)
30,393,208,229 ls_stlf (4.26%)
269,287,215 ls_tablewalker.perf_mon_tablewalk_alloc_dside (4.26%)
17,352,694 ls_tablewalker.perf_mon_tablewalk_alloc_iside (4.26%)
15,005,647,155 ic_oc_mode_switch.ic_oc_mode_switch (4.27%)
15,001,835,182 ic_oc_mode_switch.oc_ic_mode_switch (4.27%)
57.094810950 seconds time elapsed
3353.617860000 seconds user
6.296476000 seconds sys
What does the % mean? They don't seem to add up to 100% 👀
What does the % mean? They don't seem to add up to 100% eyes
https://stackoverflow.com/questions/33679408/perf-what-do-n-percent-records-mean-in-perf-stat-output
I'd like to create some non-functional performance metrics that we can plot per commit to monitor performance regressions, e.g. ratio of hits to misses for branches, L1, L2 hits to misses, AVX versus non AVX instructions, etc. Suggestions and ideas wanted!
brnch_pred_hit_ratio = bp_l1_tlb_miss_l2_hit / bp_l1_tlb_miss_l2_miss
sse_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.sse_instr
x87_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.x87_instr
avx_instr_ratio = ex_ret_instr / fp_ret_sse_avx_ops.all
avx_div_ratio = fp_ret_sse_avx_ops.all / fp_ret_sse_avx_ops.dp_div_flops
sims_wallclock_secs = sim_size / seconds time elapsed
sims_user_secs = sim_size / seconds user
sims_sys_secs = sim_size / seconds sys
I'd like to create some non-functional performance metrics that we can plot per commit to monitor performance regressions, e.g. ratio of hits to misses for branches, L1, L2 hits to misses, AVX versus non AVX instructions, etc. Suggestions and ideas wanted!
brnch_pred_hit_ratio = bp_l1_tlb_miss_l2_hit / bp_l1_tlb_miss_l2_miss sse_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.sse_instr x87_instr_ratio = ex_ret_instr / ex_ret_mmx_fp_instr.x87_instr avx_instr_ratio = ex_ret_instr / fp_ret_sse_avx_ops.all avx_div_ratio = fp_ret_sse_avx_ops.all / fp_ret_sse_avx_ops.dp_div_flops sims_wallclock_secs = sim_size / seconds time elapsed sims_user_secs = sim_size / seconds user sims_sys_secs = sim_size / seconds sys
Looks great! The only comment I have is that the sim_size
in the future won't be uniform, so the sims_per_sec metric may change for future examples. Each gridpoint can have a different sim_size
but for simplicity I made the sim_size
the same for all gridpoints in this script.
Done
commit c4a884980340865a73a01795df4489e56963ef51: