Open egaudry opened 2 years ago
For the sake of gathering information, here are some abstract from the ARM technical reference paper referenced above:
A6.1 About the L1 memory system
The Neoverse V1 L1 memory system is designed to enhance core performance and save power.
The L1 memory system consists of separate instruction and data caches. Both have a fixed size of 64KB.
A6.1.1 L1 instruction-side memory system
The L1 instruction memory system has the following key features:
• Virtually Indexed, Physically Tagged (VIPT) 4-way set-associative L1 instruction cache, which
behaves as a Physically Indexed, Physically Tagged (PIPT) cache
• Fixed cache line length of 64 bytes
• Pseudo-LRU cache replacement policy
• 256-bit read interface from the L2 memory system
• Optional instruction cache hardware coherency
The Neoverse V1 core also has a Virtually Indexed, Virtually Tagged (VIVT) 4-way skewed-associative,
Macro-OP (MOP) cache, which behaves as a PIPT cache.
A6.1.2 L1 data-side memory system
The L1 data memory system has the following features:
• Virtually Indexed, Physically Tagged (VIPT), which behaves as a Physically Indexed, Physically
Tagged (PIPT) 4-way set-associative L1 data cache
• Fixed cache line length of 64 bytes
• Pseudo-LRU cache replacement policy
• 512-bit write interface from the L2 memory system
• 512-bit read interface from the L2 memory system
• One 128-bit and two 256-bit read paths from the data L1 memory system to the datapath
• 256-bit write path from the datapath to the L1 memory system
A7.1 About the L2 memory system
The L2 memory subsystem consists of:
• An 8-way set associative L2 cache with a configurable size of 512KB or 1MB. Cache lines have a
fixed length of 64 bytes.
• ECC protection for all RAM structures except victim array.
• Strictly inclusive with L1 data cache. Weakly inclusive with L1 instruction cache.
• Configurable CHI interface to the DynamicIQ Shared Unit (DSU) or CHI compliant system with
support for a 256-bit data width.
• Dynamic biased replacement policy.
• Modified Exclusive Shared Invalid (MESI) coherency
Retrieved from a running system:
/sys/devices/system/cpu/cpu0/cache/index0/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index0/level:1
/sys/devices/system/cpu/cpu0/cache/index0/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index0/type:Data
/sys/devices/system/cpu/cpu0/cache/index0/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index1/allocation_policy:ReadAllocate
/sys/devices/system/cpu/cpu0/cache/index1/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index1/level:1
/sys/devices/system/cpu/cpu0/cache/index1/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
/sys/devices/system/cpu/cpu0/cache/index1/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index2/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index2/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index2/level:2
/sys/devices/system/cpu/cpu0/cache/index2/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
/sys/devices/system/cpu/cpu0/cache/index2/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index3/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index3/level:3
/sys/devices/system/cpu/cpu0/cache/index3/number_of_sets:32768
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-31
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map:ffffffff
/sys/devices/system/cpu/cpu0/cache/index3/size:32768K
/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
/sys/devices/system/cpu/cpu0/cache/index3/ways_of_associativity:16
/sys/devices/system/cpu/cpu0/cache/index3/write_policy:WriteBack
Based on this, I'm not sure how to setup the below macro correctly. I read https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md, more related to x86, and I used the following values however performance remained terrible.
#define W_L1_SVE_DEFAULT 64
#define N_L1_SVE_DEFAULT 4
#define C_L1_SVE_DEFAULT 64
#define W_L2_SVE_DEFAULT 512
#define N_L2_SVE_DEFAULT 8
#define C_L2_SVE_DEFAULT 64
I believe I missed some points regarding bli_cntx_init_armsve
as well.
@devinamatthews @xrq-phys this is the information I received for AWS Graviton 3:
L1 associativity 4, size 64KB, 256 sets, 64B lines L2 associativity 8, size 1MB, 2k sets, 64B lines L3 32MB, 64B lines, massive associativity
would this translates to (L3 values are just a guess)
#define W_L1_SVE_DEFAULT 256
#define N_L1_SVE_DEFAULT 4
#define C_L1_SVE_DEFAULT 64
#define W_L2_SVE_DEFAULT 2048
#define N_L2_SVE_DEFAULT 8
#define C_L2_SVE_DEFAULT 64
#define W_L3_SVE_DEFAULT 8192
#define N_L3_SVE_DEFAULT 4
#define C_L3_SVE_DEFAULT 64
?
Sorry for the delay.
It seems strange to me even with correctly set cache sizes you are only able to get 25% perf.
Another benchmark on V1 once told me a 10% boost.
I'll see if I can find any Graviton 3 nodes avail to me.
thanks, I would expect a performance bump indeed. fwiw, I'm using gcc-11 to build blis-master (explicitly building the armsve variant).
By reading at the source, I do not see and understand how we would use a SVE256 implementation when running armsve on a neoverse-v1 chip: there is none.
armsve kernels are vl-agnostic.
On Thu, Apr 7, 2022 at 19:45 egaudry @.***> wrote:
By reading at the source, I do not see and understand how we would use a SVE256 implementation when running armsve on a neoverse-v1 chip: there is none.
— Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/616#issuecomment-1091563287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4GUGVHBJPNLUTQCHQRRC3VD24C5ANCNFSM5PDXTKGA . You are receiving this because you were mentioned.Message ID: @.***>
--
Xu RuQing 許 ルーキン 東大理学系研究科 物理学専攻藤堂研 所属研究室Eメール: @. 東京大学生Eメール: @.
I'm sorry, I don't understand. Do you mean that no ASM kernels would be needed to get the best performance when running on SVE 256-bit wide ?
Assembly kernels like bli_gemm_armsve_asm_d2vx10_unindexed.c https://github.com/xrq-phys/blis/blob/main-dev/kernels/armsve/3/bli_gemm_armsve_asm_d2vx10_unindexed.c are VL-agnostic. What it does is a 2*VL-by-10 GEMM.
Sorry I didn't manage to grab time for investigating Graviton 3 performance. Maybe a week later.
2022年4月7日(木) 20:51 egaudry @.***>:
I'm sorry, I don't understand. Do you mean that no ASM kernels would be needed to get the best performance when running on SVE 256-bit wide ?
— Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/616#issuecomment-1091640964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4GUGXSPGCHN6FMNSCSW5TVD3D3XANCNFSM5PDXTKGA . You are receiving this because you were mentioned.Message ID: @.***>
ok, thanks for your feedback, my understanding was wrong obviously then :). good luck with the graviton3 when you have time then !
Hi.
I just tried to launch a C7g instance on AWS since it became generally available at the end of May.
However, I cannot seem to reproduce the 75% perf. decline you claimed in this issue. Rather, it's a 10% perf. gain:
--- run/blis_on_c7g> cat tx2.x/tx2.out.m | grep dgemm_nn_ccc | grep 360 # ThunderX2 config w/ NEON kernels.
blis_dgemm_nn_ccc 360 360 360 20.73 9.64e-18 PASS
--- run/blis_on_c7g> cat neon.x/firestorm.out.m | grep dgemm_nn_ccc | grep 360 # Firestorm config w/ NEON kernels.
blis_dgemm_nn_ccc 360 360 360 21.35 9.61e-18 PASS
--- run/blis_on_c7g> cat native.x/sve256.out.m | grep dgemm_nn_ccc | grep 360 # ArmSVE config w/ SVE kernels.
blis_dgemm_nn_ccc 360 360 360 23.20 9.72e-18 PASS
The full output is here: out.m.tar.gz
As a user running on a node based on neoverse-v1 design, I'd like to us the armsve kernels with a better performance level than the neon-based ones.
This issue is a follow' up of https://github.com/flame/blis/issues/613 and https://github.com/flame/blis/issues/612 where the question of using generic values for
was raised, as performance using armsve with d5146582b1f1bcdccefe23925d3b114d40cd7e31 was 25% of that when running with thunderx2 kernels.
As noted by @xrq-phys,
The data found here https://en.wikichip.org/wiki/arm_holdings/microarchitectures/neoverse_v1 and https://developer.arm.com/documentation/101427/latest/ might help.