Closed fairydreaming closed 2 days ago
I tried disassembling the crashing function:
Thread 9 "likwid-bench" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff1056006c0 (LWP 5251)]
0x0000555555560683 in stream_mem ()
(gdb) disass
Dump of assembler code for function stream_mem:
0x0000555555560600 <+0>: push %rbp
0x0000555555560601 <+1>: mov %rsp,%rbp
0x0000555555560604 <+4>: push %rbx
0x0000555555560605 <+5>: push %r12
0x0000555555560607 <+7>: push %r13
0x0000555555560609 <+9>: push %r14
0x000055555556060b <+11>: push %r15
0x000055555556060d <+13>: movsd 0x254ab(%rip),%xmm4 # 0x555555585ac0
0x0000555555560615 <+21>: xor %rax,%rax
0x0000555555560618 <+24>: data16 cs nopw 0x0(%rax,%rax,1)
0x0000555555560623 <+35>: data16 cs nopw 0x0(%rax,%rax,1)
0x000055555556062e <+46>: data16 cs nopw 0x0(%rax,%rax,1)
0x0000555555560639 <+57>: nopl 0x0(%rax)
0x0000555555560640 <+64>: movsd (%rdx,%rax,8),%xmm0
0x0000555555560645 <+69>: movsd 0x8(%rdx,%rax,8),%xmm1
0x000055555556064b <+75>: movsd 0x10(%rdx,%rax,8),%xmm2
0x0000555555560651 <+81>: movsd 0x18(%rdx,%rax,8),%xmm3
0x0000555555560657 <+87>: mulsd %xmm4,%xmm0
0x000055555556065b <+91>: addsd (%rcx,%rax,8),%xmm0
0x0000555555560660 <+96>: mulsd %xmm4,%xmm1
0x0000555555560664 <+100>: addsd 0x8(%rcx,%rax,8),%xmm1
0x000055555556066a <+106>: mulsd %xmm4,%xmm2
0x000055555556066e <+110>: addsd 0x10(%rcx,%rax,8),%xmm2
0x0000555555560674 <+116>: mulsd %xmm4,%xmm3
0x0000555555560678 <+120>: addsd 0x18(%rcx,%rax,8),%xmm3
0x000055555556067e <+126>: movntdq %xmm0,(%rsi,%rax,8)
=> 0x0000555555560683 <+131>: movntdq %xmm1,0x8(%rsi,%rax,8)
0x0000555555560689 <+137>: movntdq %xmm2,0x10(%rsi,%rax,8)
0x000055555556068f <+143>: movntdq %xmm3,0x18(%rsi,%rax,8)
0x0000555555560695 <+149>: add $0x4,%rax
0x0000555555560699 <+153>: cmp %rdi,%rax
0x000055555556069c <+156>: jl 0x555555560640 <stream_mem+64>
0x000055555556069e <+158>: pop %r15
--Type <RET> for more, q to quit, c to continue without paging--
0x00005555555606a0 <+160>: pop %r14
0x00005555555606a2 <+162>: pop %r13
0x00005555555606a4 <+164>: pop %r12
0x00005555555606a6 <+166>: pop %rbx
0x00005555555606a7 <+167>: mov %rbp,%rsp
0x00005555555606aa <+170>: pop %rbp
0x00005555555606ab <+171>: ret
End of assembler dump.
(gdb) info registers
rax 0x0 0
rbx 0x1 1
rcx 0x7ffea473e6c0 140731657479872
rdx 0x7fff4373e6c0 140734325057216
rsi 0x7fffe273e6c0 140736992634560
rdi 0x27bc868 41666664
rbp 0x7ff1055ffe20 0x7ff1055ffe20
rsp 0x7ff1055ffdf8 0x7ff1055ffdf8
r8 0x0 0
r9 0x40 64
r10 0x13de4340 333333312
r11 0x8 8
r12 0x27bc868 41666664
r13 0x555555571cba 93824992353466
r14 0x555555560600 93824992282112
r15 0x7ff1055ffe60 140673154023008
rip 0x555555560683 0x555555560683 <stream_mem+131>
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
k0 0x1084081e 277088286
k1 0xfffe0000 4294836224
k2 0xfbfffbff 4227857407
k3 0x0 0
k4 0x2040400 33817600
k5 0x80 128
k6 0x0 0
k7 0x0 0
fs_base 0x7ff1056006c0 140673154025152
gs_base 0x0 0
I wonder if the problem is caused by memory alignment when doing movntdq (shall be aligned to 16, is aligned to 8?).
Here's how I fixed this:
diff --git a/bench/GCC/stream_mem.pas.bak b/bench/GCC/stream_mem.pas
index 9e61bbc..8493f0e 100644
--- a/bench/GCC/stream_mem.pas.bak
+++ b/bench/GCC/stream_mem.pas
@@ -40,10 +40,10 @@ mulsd FPR3, FPR5
addsd FPR3, [STR2 + GPR1*8+16]
mulsd FPR4, FPR5
addsd FPR4, [STR2 + GPR1*8+24]
-movntdq [STR0 + GPR1*8], FPR1
-movntdq [STR0 + GPR1*8+8], FPR2
-movntdq [STR0 + GPR1*8+16], FPR3
-movntdq [STR0 + GPR1*8+24], FPR4
+unpcklpd FPR1,FPR2
+unpcklpd FPR3,FPR4
+movntpd [STR0 + GPR1*8], FPR1
+movntpd [STR0 + GPR1*8+16], FPR3
}
but my knowledge of assembly is limited, so I'm not 100% sure it's correct.
Thanks for the issue and the great analysis (:bouquet:). You are correct, movntpd
requires 16 byte aligned addresses. Your fix using unpck*
is correct.
Description for unpcklpd FPR1,FPR2
:
FPR1[0:63] = FPR1[0:63]
FPR1[64:127] = FPR2[0:63]
The main issue I see with the kernel is that it is not pure scalar code. It requires SSE to work since movnt*
is a SSE instruction. Arithmetic is scalar but the data movement requires SSE.
Can you please open a PR with your fix so that you are associated with the fix. Please update the description to "uses scalar arithmetic and SSE non-temporal stores". There is currently no stream_sp_mem
, so if you like, include that kernel in the PR as well.
@TomTheBear Sure, I'll try to prepare a PR.
While we are at it, are INSTR_LOOP 7
and UOPS 8
values correct in stream_mem.ptt? I mean in stream.ptt they are 19 and 26 but these kernels differ only in store instructions, everything else is the same. So I think the correct value for INSTR_LOOP in stream_mem.ptt is 19 as well, and my fix didn't change the number of instructions.
Regarding UOPS - in stream.ptt, did you count 2 (unfused domain) UOPS for addsd and movsd stores and 1 UOP for everything else, that is 24 + 2 for the loop logic = 26? Now in the corrected stream_mem.ptt instruction movntpd(M128, XMM) also has 2 UOPS in unfused domain, but instruction unpcklpd(XMM, XMM) has only 1 UOP, so I guess the value of UOPS in stream_mem.ptt after my fix shall be 22 + 2 = 24?
Please correct me if I'm wrong, as this is all something completely new to me.
https://github.com/RRZE-HPC/likwid/pull/650#issuecomment-2500895809
I always use the fused-domain uops because that's what you get from the hardware when measuring uops retired.
Regarding your posts on phoronix.com:
likwid-bench
does not include the write-allocate/read-for-ownership (you call them phantom reads in your post). They could be included (as we know what is executing) but we agreed internally to not do that because it is not performance/data transfer as seen by the application. Moreover, recent Intel chips and ARM chips have their own mechanisms to avoid the write-allocate/RFO/"phantom reads" (Intel: SpecI2M, ARM: cache-line claim).triad
is the "Schönauer Triad". Regarding your posts on phoronix.com:
* `likwid-bench` does not include the write-allocate/read-for-ownership (you call them phantom reads in your post). They could be included (as we know what is executing) but we agreed internally to not do that because it is not performance/data transfer as seen by the application. Moreover, recent Intel chips and ARM chips have their own mechanisms to avoid the write-allocate/RFO/"phantom reads" (Intel: SpecI2M, ARM: cache-line claim).
Thank you for the clarification on this matter.
The stream_mem benchmark in likwid-bench always crashes after starting threads on my Epyc 9374F:
The crash happens in stream_mem() function:
I tried some other benchmarks like stream, stream_avx512, stream_mem_avx512, they run without any crashes.
Note that I have NUMA per socket BIOS option set to NPS4 and ACPI SRAT L3 Cache as NUMA Domain option enabled, so overall there are 8 NUMA domains in my system.
I have likwid version v5.4.0 compiled from the github release source code. My operating system is Ubuntu 24.04.1 LTS. The gcc version is gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0.
Output of likwid-bench -p:
Output of likwid-topology -V 3:
Let me know if you need any other information.