flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.31k stars 367 forks source link

armsve: generic kernel and default cache values #616

Open egaudry opened 2 years ago

egaudry commented 2 years ago

As a user running on a node based on neoverse-v1 design, I'd like to us the armsve kernels with a better performance level than the neon-based ones.

This issue is a follow' up of https://github.com/flame/blis/issues/613 and https://github.com/flame/blis/issues/612 where the question of using generic values for

BLIS_SVE_W_L1 # L1 number of sets
BLIS_SVE_N_L1 # L1 associativity degree
BLIS_SVE_C_L1 # L1 cache line size in bytes
BLIS_SVE_W_L2 # L2 number of sets
BLIS_SVE_N_L2 # L2 associativity degree
BLIS_SVE_C_L2 # L2 cache line size in bytes
BLIS_SVE_W_L3 # any big value
BLIS_SVE_N_L3 # 4 is OK
BLIS_SVE_C_L3 # any big value

was raised, as performance using armsve with d5146582b1f1bcdccefe23925d3b114d40cd7e31 was 25% of that when running with thunderx2 kernels.

As noted by @xrq-phys,

There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.

The data found here https://en.wikichip.org/wiki/arm_holdings/microarchitectures/neoverse_v1 and https://developer.arm.com/documentation/101427/latest/ might help.

egaudry commented 2 years ago

For the sake of gathering information, here are some abstract from the ARM technical reference paper referenced above:

A6.1 About the L1 memory system
The Neoverse V1 L1 memory system is designed to enhance core performance and save power.
The L1 memory system consists of separate instruction and data caches. Both have a fixed size of 64KB.

A6.1.1 L1 instruction-side memory system
The L1 instruction memory system has the following key features:
• Virtually Indexed, Physically Tagged (VIPT) 4-way set-associative L1 instruction cache, which
behaves as a Physically Indexed, Physically Tagged (PIPT) cache
• Fixed cache line length of 64 bytes
• Pseudo-LRU cache replacement policy
• 256-bit read interface from the L2 memory system
• Optional instruction cache hardware coherency
The Neoverse V1 core also has a Virtually Indexed, Virtually Tagged (VIVT) 4-way skewed-associative,
Macro-OP (MOP) cache, which behaves as a PIPT cache.

A6.1.2 L1 data-side memory system
The L1 data memory system has the following features:
• Virtually Indexed, Physically Tagged (VIPT), which behaves as a Physically Indexed, Physically
Tagged (PIPT) 4-way set-associative L1 data cache
• Fixed cache line length of 64 bytes
• Pseudo-LRU cache replacement policy
• 512-bit write interface from the L2 memory system
• 512-bit read interface from the L2 memory system
• One 128-bit and two 256-bit read paths from the data L1 memory system to the datapath
• 256-bit write path from the datapath to the L1 memory system
A7.1 About the L2 memory system
The L2 memory subsystem consists of:
• An 8-way set associative L2 cache with a configurable size of 512KB or 1MB. Cache lines have a
fixed length of 64 bytes.
• ECC protection for all RAM structures except victim array.
• Strictly inclusive with L1 data cache. Weakly inclusive with L1 instruction cache.
• Configurable CHI interface to the DynamicIQ Shared Unit (DSU) or CHI compliant system with
support for a 256-bit data width.
• Dynamic biased replacement policy.
• Modified Exclusive Shared Invalid (MESI) coherency
egaudry commented 2 years ago

Retrieved from a running system:

/sys/devices/system/cpu/cpu0/cache/index0/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index0/level:1
/sys/devices/system/cpu/cpu0/cache/index0/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index0/type:Data
/sys/devices/system/cpu/cpu0/cache/index0/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index1/allocation_policy:ReadAllocate
/sys/devices/system/cpu/cpu0/cache/index1/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index1/level:1
/sys/devices/system/cpu/cpu0/cache/index1/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
/sys/devices/system/cpu/cpu0/cache/index1/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index2/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index2/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index2/level:2
/sys/devices/system/cpu/cpu0/cache/index2/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
/sys/devices/system/cpu/cpu0/cache/index2/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index3/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index3/level:3
/sys/devices/system/cpu/cpu0/cache/index3/number_of_sets:32768
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-31
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map:ffffffff
/sys/devices/system/cpu/cpu0/cache/index3/size:32768K
/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
/sys/devices/system/cpu/cpu0/cache/index3/ways_of_associativity:16
/sys/devices/system/cpu/cpu0/cache/index3/write_policy:WriteBack

Based on this, I'm not sure how to setup the below macro correctly. I read https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md, more related to x86, and I used the following values however performance remained terrible.

#define W_L1_SVE_DEFAULT 64
#define N_L1_SVE_DEFAULT 4
#define C_L1_SVE_DEFAULT 64
#define W_L2_SVE_DEFAULT 512
#define N_L2_SVE_DEFAULT 8
#define C_L2_SVE_DEFAULT 64

I believe I missed some points regarding bli_cntx_init_armsve as well.

egaudry commented 2 years ago

@devinamatthews @xrq-phys this is the information I received for AWS Graviton 3:

L1 associativity 4, size 64KB, 256 sets, 64B lines L2 associativity 8, size 1MB, 2k sets, 64B lines L3 32MB, 64B lines, massive associativity

would this translates to (L3 values are just a guess)

#define W_L1_SVE_DEFAULT 256
#define N_L1_SVE_DEFAULT 4
#define C_L1_SVE_DEFAULT 64
#define W_L2_SVE_DEFAULT 2048
#define N_L2_SVE_DEFAULT 8
#define C_L2_SVE_DEFAULT 64
#define W_L3_SVE_DEFAULT 8192
#define N_L3_SVE_DEFAULT 4
#define C_L3_SVE_DEFAULT 64

?

xrq-phys commented 2 years ago

Sorry for the delay.

It seems strange to me even with correctly set cache sizes you are only able to get 25% perf.

Another benchmark on V1 once told me a 10% boost.

I'll see if I can find any Graviton 3 nodes avail to me.

egaudry commented 2 years ago

thanks, I would expect a performance bump indeed. fwiw, I'm using gcc-11 to build blis-master (explicitly building the armsve variant).

egaudry commented 2 years ago

By reading at the source, I do not see and understand how we would use a SVE256 implementation when running armsve on a neoverse-v1 chip: there is none.

xrq-phys commented 2 years ago

armsve kernels are vl-agnostic.

On Thu, Apr 7, 2022 at 19:45 egaudry @.***> wrote:

By reading at the source, I do not see and understand how we would use a SVE256 implementation when running armsve on a neoverse-v1 chip: there is none.

— Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/616#issuecomment-1091563287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4GUGVHBJPNLUTQCHQRRC3VD24C5ANCNFSM5PDXTKGA . You are receiving this because you were mentioned.Message ID: @.***>

--

Xu RuQing ルーキン 東大理学系研究科 物理学専攻藤堂研 所属研究室Eメール: @. 東京大学生Eメール: @.

egaudry commented 2 years ago

I'm sorry, I don't understand. Do you mean that no ASM kernels would be needed to get the best performance when running on SVE 256-bit wide ?

xrq-phys commented 2 years ago

Assembly kernels like bli_gemm_armsve_asm_d2vx10_unindexed.c https://github.com/xrq-phys/blis/blob/main-dev/kernels/armsve/3/bli_gemm_armsve_asm_d2vx10_unindexed.c are VL-agnostic. What it does is a 2*VL-by-10 GEMM.

Sorry I didn't manage to grab time for investigating Graviton 3 performance. Maybe a week later.

2022年4月7日(木) 20:51 egaudry @.***>:

I'm sorry, I don't understand. Do you mean that no ASM kernels would be needed to get the best performance when running on SVE 256-bit wide ?

— Reply to this email directly, view it on GitHub https://github.com/flame/blis/issues/616#issuecomment-1091640964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4GUGXSPGCHN6FMNSCSW5TVD3D3XANCNFSM5PDXTKGA . You are receiving this because you were mentioned.Message ID: @.***>

egaudry commented 2 years ago

ok, thanks for your feedback, my understanding was wrong obviously then :). good luck with the graviton3 when you have time then !

xrq-phys commented 2 years ago

Hi.

I just tried to launch a C7g instance on AWS since it became generally available at the end of May.

However, I cannot seem to reproduce the 75% perf. decline you claimed in this issue. Rather, it's a 10% perf. gain:

--- run/blis_on_c7g> cat tx2.x/tx2.out.m | grep dgemm_nn_ccc | grep 360 # ThunderX2 config w/ NEON kernels.
blis_dgemm_nn_ccc                  360   360   360    20.73   9.64e-18   PASS
--- run/blis_on_c7g> cat neon.x/firestorm.out.m | grep dgemm_nn_ccc | grep 360 # Firestorm config w/ NEON kernels.
blis_dgemm_nn_ccc                  360   360   360    21.35   9.61e-18   PASS
--- run/blis_on_c7g> cat native.x/sve256.out.m | grep dgemm_nn_ccc | grep 360 # ArmSVE config w/ SVE kernels.
blis_dgemm_nn_ccc                  360   360   360    23.20   9.72e-18   PASS

The full output is here: out.m.tar.gz