OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.38k stars 1.5k forks source link

Illegal instruction in dcopy_k_CORE2 on a CORE2 #3263

Closed scottgasch closed 3 years ago

scottgasch commented 3 years ago

Hi,

I'm getting an Illegal Instruction signal when using openblot via numpy. Here's the call stack:

0 0x0000000804921e9c in dcopy_k_CORE2 () from /usr/local/lib/libopenblas.so.0

1 0x000000082b4d80b6 in cauchy (n=5713, x=..., l=..., u=..., nbd=..., g=...,

iorder=..., iwhere=..., t=..., d=..., xcp=..., m=10, wy=..., ws=..., sy=...,
wt=..., theta=1, col=0, head=1, p=..., c=..., wbp=..., v=..., nseg=0,
iprint=-1, sbgnrm=0.16177334220291192, info=0, epsmch=2.2204460492503131e-16)
at scipy/optimize/lbfgsb_src/lbfgsb.f:1507

2 0x000000082b4dbb04 in mainlb (n=5713, m=10, x=..., l=..., u=..., nbd=...,

f=0.48867889557385025, g=..., factr=10000000, pgtol=0.0001, ws=..., wy=...,
sy=..., ss=..., wt=..., wn=..., snd=..., z=..., r=..., d=..., t=..., xp=...,
wa=..., index=..., iwhere=..., indx2=..., task=..., iprint=-1, csave=...,
lsave=..., isave=..., dsave=..., maxls=20, _task=60, _csave=60)
at scipy/optimize/lbfgsb_src/lbfgsb.f:669

3 0x000000082b4dd70b in setulb (n=5713, m=10, x=..., l=..., u=..., nbd=...,

f=0.48867889557385025, g=..., factr=10000000, pgtol=0.0001, wa=..., iwa=...,
task=..., iprint=-1, csave=..., lsave=..., isave=..., dsave=..., maxls=20,
_task=60, _csave=60) at scipy/optimize/lbfgsb_src/lbfgsb.f:273

4 0x000000082b4c98c2 in f2py_rout.lbfgsb_setulb ()

from /home/scott/remote-execution/lib/python3.7/site-packages/scipy/optimize/_lbfgsb.so

5 0x0000000800364242 in _PyObject_FastCallKeywords ()

from /usr/local/lib/libpython3.7m.so.1.0

6 0x0000000800424ddb in ?? () from /usr/local/lib/libpython3.7m.so.1.0

7 0x000000080042204c in _PyEval_EvalFrameDefault ()

from /usr/local/lib/libpython3.7m.so.1.0 ...

I've reproed this with OpenBLAS versions 0.3.15 and a couple of previous older builds. I've tried various TARGETs and can't find any that work.

The CPU I'm running on is this. It's old. Is that the problem?

eax in eax ebx ecx edx 00000000 0000000a 756e6547 6c65746e 49656e69 00000001 000006fb 01040800 0000e3bd bfebfbff 00000002 05b0b101 005657f0 00000000 2cb43049 00000003 00000000 00000000 00000000 00000000 00000004 0c000121 01c0003f 0000003f 00000001 00000005 00000040 00000040 00000003 00000020 00000006 00000001 00000002 00000001 00000000 00000007 00000000 00000000 00000000 00000000 00000008 00000400 00000000 00000000 00000000 00000009 00000000 00000000 00000000 00000000 0000000a 07280202 00000000 00000000 00000503 80000000 80000008 00000000 00000000 00000000 80000001 00000000 00000000 00000001 20100800 80000002 65746e49 2952286c 726f4320 4d542865 80000003 51203229 20646175 20555043 51202020 80000004 30303636 20402020 30342e32 007a4847 80000005 00000000 00000000 00000000 00000000 80000006 00000000 00000000 10008040 00000000 80000007 00000000 00000000 00000000 00000000 80000008 00003024 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 10

Intel-specific functions: Version 000006fb: Type 0 - Original OEM Family 6 - Pentium Pro Model 15 - Intel Core2 family processor, 65nm Stepping 11 Reserved 0

Extended brand string: "Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 1 Hyper threading siblings: 4

Feature flags set 1 (CPUID.01H:EDX): bfebfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor 31 Pending Break Enable

Feature flags set 2 (CPUID.01H:ECX): 0000e3bd: SSE3 SSE3 extensions DTES64 64-bit debug store MONITOR MONITOR/MWAIT instructions DS-CPL CPL Qualified Debug Store VMX Virtual Machine Extensions EST Enhanced Intel SpeedStep Technology TM2 Thermal Monitor 2 SSSE3 Supplemental Streaming SIMD Extension 3 CX16 CMPXCHG16B xTPR Send Task Priority messages PDCM Perfmon and debug capability

Extended feature flags set 1 (CPUID.80000001H:EDX): 20100800 SYSCALL SYSCALL/SYSRET instructions XD-bit Execution Disable bit EM64T Intel Extended Memory 64 Technology

Extended feature flags set 2 (CPUID.80000001H:ECX): 00000001 LAHF LAHF/SAHF available in IA-32e mode

Old-styled TLB and cache info: b1: Instruction TLB: 2MB Pages (8 entries) or 4MB pages (4 entries), 4-way set associative b0: Instruction TLB: 4-KB Pages, 4-way set associative, 128 entries 05: Data TLB: 4MB pages, 4-way set assoc, 32 entries f0: 64-byte prefetching 57: Data TLB: 4KB pages, 4-way set associative, 16 entries 56: Data TLB: 4MB pages, 4-way set associative, 16 entries 49: 3rd-level cache: 4MB, 16-way set associative, 64-byte line size (Intel Xeon MP, Family 0Fh, Model 06h OR 2nd-level cache: 4MB, 16-way set associative, 64-byte line size 30: 1st-level instruction cache: 32-KB, 8-way set associative, 64-byte line size b4: Data TLB: 4-KB Pages, 4-way set associative, 256 entries 2c: 1st-level data cache: 32-KB, 8-way set associative, 64-byte line size

Deterministic Cache Parameters: index=0: eax=0c000121 ebx=01c0003f ecx=0000003f edx=00000001

Data cache, level 1, self initializing 64 sets, 8 ways, 1 partitions, line size 64 full size 32768 bytes NB this package has up to 4 threads index=1: eax=0c000122 ebx=01c0003f ecx=0000003f edx=00000001 Instruction cache, level 1, self initializing 64 sets, 8 ways, 1 partitions, line size 64 full size 32768 bytes index=2: eax=0c004143 ebx=03c0003f ecx=00000fff edx=00000001 Unified cache, level 2, self initializing 4096 sets, 16 ways, 1 partitions, line size 64 full size 4194304 bytes shared between up to 2 threads

Here's the disassembled code:

Dump of assembler code for function dcopy_k_CORE2: 0x0000000804921e00 <+0>: lea rdx,[rdx8+0x0] 0x0000000804921e08 <+8>: lea r8,[r88+0x0] 0x0000000804921e10 <+16>: cmp rdx,0x8 0x0000000804921e14 <+20>: jne 0x8049221a0 <dcopy_k_CORE2+928> 0x0000000804921e1a <+26>: cmp r8,0x8 0x0000000804921e1e <+30>: jne 0x8049221a0 <dcopy_k_CORE2+928> 0x0000000804921e24 <+36>: test rcx,0x8 0x0000000804921e2b <+43>: je 0x804921e50 <dcopy_k_CORE2+80> 0x0000000804921e31 <+49>: movsd xmm0,QWORD PTR [rsi] 0x0000000804921e35 <+53>: movsd QWORD PTR [rcx],xmm0 0x0000000804921e39 <+57>: add rsi,0x8 0x0000000804921e3d <+61>: add rcx,0x8 0x0000000804921e41 <+65>: dec rdi 0x0000000804921e44 <+68>: jle 0x804921fb0 <dcopy_k_CORE2+432> 0x0000000804921e4a <+74>: nop WORD PTR [rax+rax*1+0x0] 0x0000000804921e50 <+80>: sub rsi,0xffffffffffffff80 0x0000000804921e54 <+84>: sub rcx,0xffffffffffffff80 0x0000000804921e58 <+88>: test rsi,0x8 0x0000000804921e5f <+95>: jne 0x804921fb8 <dcopy_k_CORE2+440> 0x0000000804921e65 <+101>: mov rax,rdi 0x0000000804921e68 <+104>: sar rax,0x4 0x0000000804921e6c <+108>: jle 0x804921f10 <dcopy_k_CORE2+272> 0x0000000804921e72 <+114>: movups xmm0,XMMWORD PTR [rsi-0x80] 0x0000000804921e76 <+118>: movups xmm1,XMMWORD PTR [rsi-0x70] 0x0000000804921e7a <+122>: movups xmm2,XMMWORD PTR [rsi-0x60] 0x0000000804921e7e <+126>: movups xmm3,XMMWORD PTR [rsi-0x50] 0x0000000804921e82 <+130>: movups xmm4,XMMWORD PTR [rsi-0x40] 0x0000000804921e86 <+134>: movups xmm5,XMMWORD PTR [rsi-0x30] 0x0000000804921e8a <+138>: movups xmm6,XMMWORD PTR [rsi-0x20] 0x0000000804921e8e <+142>: movups xmm7,XMMWORD PTR [rsi-0x10] 0x0000000804921e92 <+146>: dec rax 0x0000000804921e95 <+149>: jle 0x804921ee8 <dcopy_k_CORE2+232> 0x0000000804921e97 <+151>: nop 0x0000000804921e98 <+152>: movups XMMWORD PTR [rcx-0x80],xmm0 => 0x0000000804921e9c <+156>: movups xmm0,XMMWORD PTR [rsi]

In case it matters, numpy caller is 1.20.3 and python is 3.7.10. My repro is not simple, unfortunately, but is consistent. Happy to provide it if anyone is interested.

Is this a known issue?

Thx!

martin-frbg commented 3 years ago

Not a known issue, but I guess few people are still running Core2 code on the actual hardware. The copy_k_CORE2 code (or kernel/x86_64/copy_sse2.S, which is shared among most x86_64 cpus) is even older, but I did change all (aligned) movaps instructions to (unaligned) movups in 0.3.10 to address a problem with unaligned data "needlessly" causing SIGILL on more recent cpus (#1137). Could be this is not actually supported in all circumstances on old hardware, but could be something else entirely. Did your "couple of previous earlier builds" include anything before 0.3.10 by any chance ?

(One workaround would be to copy the relevant definitions from KERNEL.generic to KERNEL.CORE2:

SCOPYKERNEL  = ../arm/copy.c
DCOPYKERNEL  = ../arm/copy.c
CCOPYKERNEL  = ../arm/zcopy.c
ZCOPYKERNEL  = ../arm/zcopy.c

and rebuild the modified 0.3.15 with that.)

scottgasch commented 3 years ago

I tried the workaround (modifying KERNEL.CORE2 by adding those lines and rebuilding) with 0.3.15. I still get the SIGILL here:

0 0x00000008045f429c in dcopy_k () from /usr/local/lib/libopenblas.so.0

The disassembly looks like the same code so I'm not sure if I messed up the build? The symbol name is all that changed?

Dump of assembler code for function dcopy_k: 0x00000008045f4200 <+0>: lea rdx,[rdx8+0x0] 0x00000008045f4208 <+8>: lea r8,[r88+0x0] 0x00000008045f4210 <+16>: cmp rdx,0x8 0x00000008045f4214 <+20>: jne 0x8045f45a0 <dcopy_k+928> 0x00000008045f421a <+26>: cmp r8,0x8 0x00000008045f421e <+30>: jne 0x8045f45a0 <dcopy_k+928> 0x00000008045f4224 <+36>: test rcx,0x8 0x00000008045f422b <+43>: je 0x8045f4250 <dcopy_k+80> 0x00000008045f4231 <+49>: movsd xmm0,QWORD PTR [rsi] 0x00000008045f4235 <+53>: movsd QWORD PTR [rcx],xmm0 0x00000008045f4239 <+57>: add rsi,0x8 0x00000008045f423d <+61>: add rcx,0x8 0x00000008045f4241 <+65>: dec rdi 0x00000008045f4244 <+68>: jle 0x8045f43b0 <dcopy_k+432> 0x00000008045f424a <+74>: nop WORD PTR [rax+rax*1+0x0] 0x00000008045f4250 <+80>: sub rsi,0xffffffffffffff80 0x00000008045f4254 <+84>: sub rcx,0xffffffffffffff80 0x00000008045f4258 <+88>: test rsi,0x8 0x00000008045f425f <+95>: jne 0x8045f43b8 <dcopy_k+440> 0x00000008045f4265 <+101>: mov rax,rdi 0x00000008045f4268 <+104>: sar rax,0x4 0x00000008045f426c <+108>: jle 0x8045f4310 <dcopy_k+272> 0x00000008045f4272 <+114>: movups xmm0,XMMWORD PTR [rsi-0x80] 0x00000008045f4276 <+118>: movups xmm1,XMMWORD PTR [rsi-0x70] 0x00000008045f427a <+122>: movups xmm2,XMMWORD PTR [rsi-0x60] 0x00000008045f427e <+126>: movups xmm3,XMMWORD PTR [rsi-0x50] 0x00000008045f4282 <+130>: movups xmm4,XMMWORD PTR [rsi-0x40] 0x00000008045f4286 <+134>: movups xmm5,XMMWORD PTR [rsi-0x30] 0x00000008045f428a <+138>: movups xmm6,XMMWORD PTR [rsi-0x20] 0x00000008045f428e <+142>: movups xmm7,XMMWORD PTR [rsi-0x10] 0x00000008045f4292 <+146>: dec rax 0x00000008045f4295 <+149>: jle 0x8045f42e8 <dcopy_k+232> 0x00000008045f4297 <+151>: nop 0x00000008045f4298 <+152>: movups XMMWORD PTR [rcx-0x80],xmm0 => 0x00000008045f429c <+156>: movups xmm0,XMMWORD PTR [rsi]

I also tried the repro with 0.3.9 and, as you suspected, it works fine. FYI, here's the code in copy_k in 0.3.9. It's aligned movs as you suspected:

Dump of assembler code for function dcopy_k: => 0x0000000804615800 <+0>: lea rdx,[rdx8+0x0] 0x0000000804615808 <+8>: lea r8,[r88+0x0] 0x0000000804615810 <+16>: cmp rdx,0x8 0x0000000804615814 <+20>: jne 0x804615ba0 <dcopy_k+928> 0x000000080461581a <+26>: cmp r8,0x8 0x000000080461581e <+30>: jne 0x804615ba0 <dcopy_k+928> 0x0000000804615824 <+36>: test rcx,0x8 0x000000080461582b <+43>: je 0x804615850 <dcopy_k+80> 0x000000080461582d <+45>: movsd xmm0,QWORD PTR [rsi] 0x0000000804615831 <+49>: movsd QWORD PTR [rcx],xmm0 0x0000000804615835 <+53>: add rsi,0x8 0x0000000804615839 <+57>: add rcx,0x8 0x000000080461583d <+61>: dec rdi 0x0000000804615840 <+64>: jle 0x8046159b0 <dcopy_k+432> 0x0000000804615846 <+70>: nop WORD PTR cs:[rax+rax*1+0x0] 0x0000000804615850 <+80>: sub rsi,0xffffffffffffff80 0x0000000804615854 <+84>: sub rcx,0xffffffffffffff80 0x0000000804615858 <+88>: test rsi,0x8 0x000000080461585f <+95>: jne 0x8046159b8 <dcopy_k+440> 0x0000000804615865 <+101>: mov rax,rdi 0x0000000804615868 <+104>: sar rax,0x4 0x000000080461586c <+108>: jle 0x804615910 <dcopy_k+272> 0x0000000804615872 <+114>: movaps xmm0,XMMWORD PTR [rsi-0x80] 0x0000000804615876 <+118>: movaps xmm1,XMMWORD PTR [rsi-0x70] 0x000000080461587a <+122>: movaps xmm2,XMMWORD PTR [rsi-0x60] 0x000000080461587e <+126>: movaps xmm3,XMMWORD PTR [rsi-0x50] 0x0000000804615882 <+130>: movaps xmm4,XMMWORD PTR [rsi-0x40] 0x0000000804615886 <+134>: movaps xmm5,XMMWORD PTR [rsi-0x30] 0x000000080461588a <+138>: movaps xmm6,XMMWORD PTR [rsi-0x20] 0x000000080461588e <+142>: movaps xmm7,XMMWORD PTR [rsi-0x10] 0x0000000804615892 <+146>: dec rax 0x0000000804615895 <+149>: jle 0x8046158e8 <dcopy_k+232> 0x0000000804615897 <+151>: nop 0x0000000804615898 <+152>: movaps XMMWORD PTR [rcx-0x80],xmm0 0x000000080461589c <+156>: movaps xmm0,XMMWORD PTR [rsi] 0x000000080461589f <+159>: movaps XMMWORD PTR [rcx-0x70],xmm1 0x00000008046158a3 <+163>: movaps xmm1,XMMWORD PTR [rsi+0x10] 0x00000008046158a7 <+167>: movaps XMMWORD PTR [rcx-0x60],xmm2 0x00000008046158ab <+171>: movaps xmm2,XMMWORD PTR [rsi+0x20] 0x00000008046158af <+175>: movaps XMMWORD PTR [rcx-0x50],xmm3 0x00000008046158b3 <+179>: movaps xmm3,XMMWORD PTR [rsi+0x30] 0x00000008046158b7 <+183>: movaps XMMWORD PTR [rcx-0x40],xmm4 0x00000008046158bb <+187>: movaps xmm4,XMMWORD PTR [rsi+0x40] .....

Can you tell me again how to build a later version to work around? I basically started with a clean enlistment, modified kernels/x86/KERNEL.CORE2 to add those four lines at the bottom and seem to have gotten the same end copy_k_CORE2 / copy_k code after the build.

Thanks a lot for your help!

martin-frbg commented 3 years ago

You'll need to modify the KERNEL.CORE2 in kernel/x86_64 not x86 (unless you want to build OpenBLAS for 32bit).

On the other hand, as 0.3.9 appears to work correctly for you, you could also copy the copy_sse2.S file from the kernel/x86_64 of 0.3.9 to the same location in 0.3.15 and leave the KERNEL.CORE2 unchanged.

scottgasch commented 3 years ago

Something weirder is going on here.

I copyed copy_sse2.S from 0.3.9 to 0.3.15 kernel/x86_64 and rebuilt. I'm still seeing an illegal instruction at the same place, though (dcopy_k). To make sure that I got the new assembly I disassembled in gdb:

Dump of assembler code for function dcopy_k (my 0.3.15 with the .S file from 0.3.9): 0x00000008045f4200 <+0>: lea rdx,[rdx8+0x0] 0x00000008045f4208 <+8>: lea r8,[r88+0x0] 0x00000008045f4210 <+16>: cmp rdx,0x8 0x00000008045f4214 <+20>: jne 0x8045f45a0 <dcopy_k+928> 0x00000008045f421a <+26>: cmp r8,0x8 0x00000008045f421e <+30>: jne 0x8045f45a0 <dcopy_k+928> 0x00000008045f4224 <+36>: test rcx,0x8 0x00000008045f422b <+43>: je 0x8045f4250 <dcopy_k+80> 0x00000008045f4231 <+49>: movsd xmm0,QWORD PTR [rsi] 0x00000008045f4235 <+53>: movsd QWORD PTR [rcx],xmm0 0x00000008045f4239 <+57>: add rsi,0x8 0x00000008045f423d <+61>: add rcx,0x8 0x00000008045f4241 <+65>: dec rdi 0x00000008045f4244 <+68>: jle 0x8045f43b0 <dcopy_k+432> 0x00000008045f424a <+74>: nop WORD PTR [rax+rax*1+0x0] 0x00000008045f4250 <+80>: sub rsi,0xffffffffffffff80 0x00000008045f4254 <+84>: sub rcx,0xffffffffffffff80 0x00000008045f4258 <+88>: test rsi,0x8 0x00000008045f425f <+95>: jne 0x8045f43b8 <dcopy_k+440> 0x00000008045f4265 <+101>: mov rax,rdi 0x00000008045f4268 <+104>: sar rax,0x4 0x00000008045f426c <+108>: jle 0x8045f4310 <dcopy_k+272> 0x00000008045f4272 <+114>: movaps xmm0,XMMWORD PTR [rsi-0x80] 0x00000008045f4276 <+118>: movaps xmm1,XMMWORD PTR [rsi-0x70] 0x00000008045f427a <+122>: movaps xmm2,XMMWORD PTR [rsi-0x60] 0x00000008045f427e <+126>: movaps xmm3,XMMWORD PTR [rsi-0x50] 0x00000008045f4282 <+130>: movaps xmm4,XMMWORD PTR [rsi-0x40] 0x00000008045f4286 <+134>: movaps xmm5,XMMWORD PTR [rsi-0x30] 0x00000008045f428a <+138>: movaps xmm6,XMMWORD PTR [rsi-0x20] 0x00000008045f428e <+142>: movaps xmm7,XMMWORD PTR [rsi-0x10] 0x00000008045f4292 <+146>: dec rax 0x00000008045f4295 <+149>: jle 0x8045f42e8 <dcopy_k+232> 0x00000008045f4297 <+151>: nop 0x00000008045f4298 <+152>: movaps XMMWORD PTR [rcx-0x80],xmm0 => 0x00000008045f429c <+156>: movaps xmm0,XMMWORD PTR [rsi]

So I'm confused -- as you can see, it looks like this is a movaps and not a movups... so I suspect the build picked up the overwritten .S file... So I compared the 0.3.9 (working) disassembly with this one:

$ cat dis_0.3.9 0x0000000804615808 <+8>: lea r8,[r88+0x0] 0x0000000804615810 <+16>: cmp rdx,0x8 0x0000000804615814 <+20>: jne 0x804615ba0 <dcopy_k+928> 0x000000080461581a <+26>: cmp r8,0x8 0x000000080461581e <+30>: jne 0x804615ba0 <dcopy_k+928> 0x0000000804615824 <+36>: test rcx,0x8 0x000000080461582b <+43>: je 0x804615850 <dcopy_k+80> 0x000000080461582d <+45>: movsd xmm0,QWORD PTR [rsi] 0x0000000804615831 <+49>: movsd QWORD PTR [rcx],xmm0 0x0000000804615835 <+53>: add rsi,0x8 0x0000000804615839 <+57>: add rcx,0x8 0x000000080461583d <+61>: dec rdi 0x0000000804615840 <+64>: jle 0x8046159b0 <dcopy_k+432> 0x0000000804615846 <+70>: nop WORD PTR cs:[rax+rax*1+0x0] 0x0000000804615850 <+80>: sub rsi,0xffffffffffffff80 0x0000000804615854 <+84>: sub rcx,0xffffffffffffff80 0x0000000804615858 <+88>: test rsi,0x8 0x000000080461585f <+95>: jne 0x8046159b8 <dcopy_k+440> 0x0000000804615865 <+101>: mov rax,rdi 0x0000000804615868 <+104>: sar rax,0x4 0x000000080461586c <+108>: jle 0x804615910 <dcopy_k+272> 0x0000000804615872 <+114>: movaps xmm0,XMMWORD PTR [rsi-0x80] 0x0000000804615876 <+118>: movaps xmm1,XMMWORD PTR [rsi-0x70] 0x000000080461587a <+122>: movaps xmm2,XMMWORD PTR [rsi-0x60] 0x000000080461587e <+126>: movaps xmm3,XMMWORD PTR [rsi-0x50] 0x0000000804615882 <+130>: movaps xmm4,XMMWORD PTR [rsi-0x40] 0x0000000804615886 <+134>: movaps xmm5,XMMWORD PTR [rsi-0x30] 0x000000080461588a <+138>: movaps xmm6,XMMWORD PTR [rsi-0x20] 0x000000080461588e <+142>: movaps xmm7,XMMWORD PTR [rsi-0x10] 0x0000000804615892 <+146>: dec rax 0x0000000804615895 <+149>: jle 0x8046158e8 <dcopy_k+232> 0x0000000804615897 <+151>: nop 0x0000000804615898 <+152>: movaps XMMWORD PTR [rcx-0x80],xmm0 0x000000080461589c <+156>: movaps xmm0,XMMWORD PTR [rsi]

vs. $ cat dis_0.3.15 0x00000008045f4200 <+0>: lea rdx,[rdx8+0x0] 0x00000008045f4208 <+8>: lea r8,[r88+0x0] 0x00000008045f4210 <+16>: cmp rdx,0x8 0x00000008045f4214 <+20>: jne 0x8045f45a0 <dcopy_k+928> 0x00000008045f421a <+26>: cmp r8,0x8 0x00000008045f421e <+30>: jne 0x8045f45a0 <dcopy_k+928> 0x00000008045f4224 <+36>: test rcx,0x8 0x00000008045f422b <+43>: je 0x8045f4250 <dcopy_k+80> 0x00000008045f4231 <+49>: movsd xmm0,QWORD PTR [rsi] 0x00000008045f4235 <+53>: movsd QWORD PTR [rcx],xmm0 0x00000008045f4239 <+57>: add rsi,0x8 0x00000008045f423d <+61>: add rcx,0x8 0x00000008045f4241 <+65>: dec rdi 0x00000008045f4244 <+68>: jle 0x8045f43b0 <dcopy_k+432> 0x00000008045f424a <+74>: nop WORD PTR [rax+rax*1+0x0] 0x00000008045f4250 <+80>: sub rsi,0xffffffffffffff80 0x00000008045f4254 <+84>: sub rcx,0xffffffffffffff80 0x00000008045f4258 <+88>: test rsi,0x8 0x00000008045f425f <+95>: jne 0x8045f43b8 <dcopy_k+440> 0x00000008045f4265 <+101>: mov rax,rdi 0x00000008045f4268 <+104>: sar rax,0x4 0x00000008045f426c <+108>: jle 0x8045f4310 <dcopy_k+272> 0x00000008045f4272 <+114>: movaps xmm0,XMMWORD PTR [rsi-0x80] 0x00000008045f4276 <+118>: movaps xmm1,XMMWORD PTR [rsi-0x70] 0x00000008045f427a <+122>: movaps xmm2,XMMWORD PTR [rsi-0x60] 0x00000008045f427e <+126>: movaps xmm3,XMMWORD PTR [rsi-0x50] 0x00000008045f4282 <+130>: movaps xmm4,XMMWORD PTR [rsi-0x40] 0x00000008045f4286 <+134>: movaps xmm5,XMMWORD PTR [rsi-0x30] 0x00000008045f428a <+138>: movaps xmm6,XMMWORD PTR [rsi-0x20] 0x00000008045f428e <+142>: movaps xmm7,XMMWORD PTR [rsi-0x10] 0x00000008045f4292 <+146>: dec rax 0x00000008045f4295 <+149>: jle 0x8045f42e8 <dcopy_k+232> 0x00000008045f4297 <+151>: nop 0x00000008045f4298 <+152>: movaps XMMWORD PTR [rcx-0x80],xmm0 0x00000008045f429c <+156>: movaps xmm0,XMMWORD PTR [rsi]

It seems like the "preamble" (first few instructions) have changed:

0x00000008045f4200 <+0>: lea rdx,[rdx8+0x0] 0x00000008045f4208 <+8>: lea r8,[r88+0x0]

vs. in 0.3.9 we got:

0x0000000804615808 <+8>: lea r8,[r88+0x0]

Some of the "same" instructions are at new offsets. Like the nop at +70 in 0.3.9:

0x0000000804615846 <+70>: nop WORD PTR cs:[rax+rax*1+0x0]

Has moved to +74 in 0.3.15:

0x00000008045f424a <+74>: nop WORD PTR [rax+rax*1+0x0]

I wonder if just making the opcode into an aligned move is enough? Is there a commandline arg to the assembler telling it to pad things so that they are aligned in use in 0.3.9 and not in 0.3.15 or something too? If so that might explain why just copying around a .S file isn't working.

FWIW I also tried a clean build on 0.3.15 with your additional four lines in x86_64/KERNEL.CORE2. I assume this avoids the .S file completely and uses some (slower) .c code for the function? This also doesn't seem to work... SIGILL in the same function. This time it looks like:

0x00000008045f3760 <+0>: push rbp 0x00000008045f3761 <+1>: mov rbp,rsp 0x00000008045f3764 <+4>: push r15 0x00000008045f3766 <+6>: push r14 0x00000008045f3768 <+8>: push r13 0x00000008045f376a <+10>: push r12 0x00000008045f376c <+12>: push rbx 0x00000008045f376d <+13>: test rdi,rdi 0x00000008045f3770 <+16>: jle 0x8045f3872 <dcopy_k+274> 0x00000008045f3776 <+22>: lea rax,[rdi-0x1] 0x00000008045f377a <+26>: mov r13d,edi 0x00000008045f377d <+29>: and r13d,0x3 0x00000008045f3781 <+33>: cmp rax,0x3 0x00000008045f3785 <+37>: mov QWORD PTR [rbp-0x38],rsi 0x00000008045f3789 <+41>: mov QWORD PTR [rbp-0x30],rcx 0x00000008045f378d <+45>: jae 0x8045f3798 <dcopy_k+56> 0x00000008045f378f <+47>: xor ebx,ebx 0x00000008045f3791 <+49>: xor eax,eax 0x00000008045f3793 <+51>: jmp 0x8045f383b <dcopy_k+219> 0x00000008045f3798 <+56>: mov rax,r8 0x00000008045f379b <+59>: shl rax,0x5 0x00000008045f379f <+63>: mov QWORD PTR [rbp-0x50],rax 0x00000008045f37a3 <+67>: lea rax,[r88+0x0] 0x00000008045f37ab <+75>: lea rax,[rax+rax2] 0x00000008045f37af <+79>: mov QWORD PTR [rbp-0x48],rax 0x00000008045f37b3 <+83>: mov r11,r8 0x00000008045f37b6 <+86>: shl r11,0x4 0x00000008045f37ba <+90>: mov r14,rdx 0x00000008045f37bd <+93>: shl r14,0x5 0x00000008045f37c1 <+97>: lea rax,[rdx8+0x0] 0x00000008045f37c9 <+105>: lea rax,[rax+rax2] 0x00000008045f37cd <+109>: mov QWORD PTR [rbp-0x40],rax 0x00000008045f37d1 <+113>: mov r12,rdx 0x00000008045f37d4 <+116>: shl r12,0x4 0x00000008045f37d8 <+120>: and rdi,0xfffffffffffffffc 0x00000008045f37dc <+124>: neg rdi 0x00000008045f37df <+127>: xor ebx,ebx 0x00000008045f37e1 <+129>: mov r9,rsi 0x00000008045f37e4 <+132>: mov r15,rcx 0x00000008045f37e7 <+135>: xor eax,eax 0x00000008045f37e9 <+137>: nop DWORD PTR [rax+0x0] => 0x00000008045f37f0 <+144>: mov r10,QWORD PTR [r9]

So, totally different opcodes... I think that the code is changing. I'm beginning to suspect that it has to do with how the data that's being copied here is aligned which may be driven by a change in build flags between 0.3.9 and 0.3.15.

martin-frbg commented 3 years ago

Hm, this is indeed getting ugly. One other recent change was the introduction of additional compiler flags (-msse2 and the like), made necessary by contributors that used C-style Intel intrinsics instead of handcoded assembly.

scottgasch commented 3 years ago

I'm capturing the builds from 0.3.9 and 0.3.15 to compare them. I'll look at -msse2 and other compiler flag differences and see if I can figure out how to modify the Makefiles in 0.3.15 with stock code to work and report back.

martin-frbg commented 3 years ago

The difference would probably be the -msse3 -mssse3 in Makefile.x86_64 (which is required for the DOT kernels that live in kernel/generic/dot.c and IIRC the SUM kernels from kernel/arm/sum.c as well - not strictly necessary to apply them globally but I did not expect this kind of side effect as users would have been free to add them in their CFLAGS for years)

scottgasch commented 3 years ago

Here's the diff of the cc flags when compiling that .S file:

cc
-c
-O2
-pipe
-fstack-protector-strong
-fno-strict-aliasing
-O2
-DMAX_STACK_ALLOC=2048
-DEXPRECISION
-fopenmp
-Wall
-m64
-DF_INTERFACE_GFORT
-fPIC
-DSMP_SERVER
-DUSE_OPENMP
-DNO_WARMUP
-DMAX_CPU_NUMBER=64
-DMAX_PARALLEL_NUMBER=1
> -DBUILD_SINGLE=1
> -DBUILD_DOUBLE=1
> -DBUILD_COMPLEX=1
> -DBUILD_COMPLEX16=1
! -DVERSION=\"0.3.15\"
> -msse3
> -mssse3
> -UASMNAME
> -UASMFNAME
> -UNAME
> -UCNAME
> -UCHAR_NAME
> -UCHAR_CNAME
-DASMNAME=
-DASMFNAME=_
-DNAME=_
-DCNAME=
-DCHAR_NAME=\"_\"
-DCHAR_CNAME=\"\"
-DNO_AFFINITY
-I.
-O2
-DMAX_STACK_ALLOC=2048
-DEXPRECISION
-fopenmp
-Wall
-m64
-DF_INTERFACE_GFORT
-fPIC
-DSMP_SERVER
-DUSE_OPENMP
-DNO_WARMUP
-DMAX_CPU_NUMBER=64
-DMAX_PARALLEL_NUMBER=1
> -DBUILD_SINGLE=1
> -DBUILD_DOUBLE=1
> -DBUILD_COMPLEX=1
> -DBUILD_COMPLEX16=1
! -DVERSION=\"0.3.15\"
> -msse3
> -mssse3
> -UASMNAME
> -UASMFNAME
> -UNAME
> -UCNAME
> -UCHAR_NAME
> -UCHAR_CNAME
-DASMNAME=dcopy_k
-DASMFNAME=dcopy_k_
-DNAME=dcopy_k_
-DCNAME=dcopy_k
-DCHAR_NAME=\"dcopy_k_\"
-DCHAR_CNAME=\"dcopy_k\"
-DNO_AFFINITY
-I..
-DDOUBLE
-UCOMPLEX
-UCOMPLEX
-DDOUBLE
-DC_INTERFACE
../kernel/x86_64/copy_sse2.S
-o
dcopy_k.o

I tried undefining HAS_SSE3 and HAS_SSSE3 in the Makefiles... I got a clean build (and captured the logs, this was the cmdline for assembling that .S file and it doesn't have the sse3 business):

cc -c -O2 -DMAX_STACK_ALLOC=2048 -DEXPRECISION -fopenmp -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DUSE_OPENMP -DNO_WARMUP -DMAX_CPU_NUMBER=64 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.15\" -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME=dcopy_k -DASMFNAME=dcopy_k_ -DNAME=dcopy_k_ -DCNAME=dcopy_k -DCHAR_NAME=\"dcopy_k_\" -DCHAR_CNAME=\"dcopy_k\" -DNO_AFFINITY -I.. -DDOUBLE -UCOMPLEX -UCOMPLEX -DDOUBLE -DC_INTERFACE ../kernel/x86_64/copy_sse2.S -o dcopy_k.o

That said, I'm still getting an illegal instruction signal. Not really sure where to go from here... if you have suggestion I'm happy to help try to track it down. I can also just use 0.3.9 for now.

scottgasch commented 3 years ago

Maybe the -DBUILD_[SINGLE|DOUBLE|COMPLEX|COMPLEX16]=1 business?

martin-frbg commented 3 years ago

Unlikely - at least that is only supposed to allow fine-tuning if you want to build e.g. only the single-precision BLAS and LAPACK functions to get a smaller library.

scottgasch commented 3 years ago

Maybe I should use this as an excuse to tell my wife I'm buying a newer computer. I have a Haswell chipset from ~2013 and an i7 from 2015 that work fine with 0.3.15. This old Core2 is slow anyway. :)

martin-frbg commented 3 years ago

I do not think I have a genuine core2 here that could be resuscicated. Guess what you could try is comment out just the HAVE_SSSE3/-mssse3 from Makefile.x86_64 (that nobody relies on as far as I can tell) and see if that makes the compiler use some more benign alignment. If that fails I am all for declaring it a hardware problem...

brada4 commented 3 years ago

movups/movaps have identical semantics and are introduced with sse, aka Pentium3, unaligned version is 4-5x slower than aligned on old CPUs. There are 'MOV ordering issues' in CPU errata, fixed by BIOS. Please check you have latest from 2010-2012, and that /proc/cpuinfo and/or first lines of dmesg report active ucode be bf or c0. Probably GCC somehow emits same movups that triggers CPU bugs.

scottgasch commented 3 years ago

Thanks, Andrew. One question I have about this is that, with Martin's help, I've been able to provoke this SIGILL on movups instructions, movaps instructions and just plain mov instructions by messing around with the build (copying the .S file from an old version into the 0.3.15 build, modifying KERNEL.CORE2 to use generic code, etc...). This is what led me to suspect code/data alignment issues moreso than unsupported opcodes. But, IIUC, you're saying these cpu move ordering issues are relevant to all flavors of mov instructions so they could be the culprit?

The BIOS on this machine is old. (american-megatrends-0610-06-06-2008). I will look into flashing it if any update is available and report back. Thanks.

On Thu, Jun 10, 2021 at 10:28 AM Andrew @.***> wrote:

movups/movaps have identical semantics and are introduced with sse, aka Pentium3, unaligned version is 4-5x slower than aligned on old CPUs. There are 'MOV ordering issues' in CPU errata, fixed by BIOS. Please check you have latest from 2010-2012, and that /proc/cpuinfo and/or first lines of dmesg report active ucode be bf or c0. Probably GCC somehow emits same movups that triggers CPU bugs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/xianyi/OpenBLAS/issues/3263#issuecomment-858814542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGFFILE7JJIAJ3XVJB4NLTSDY5JANCNFSM46LH2N6Q .

brada4 commented 3 years ago

Typically instructions expect 8-byte double to be aligned at 8 bytes, though some CPU-s need better alignment. If you malloc() you actually get it page-aligned and do not notice.

Check dmidecode and lspci to find the manufacturer and model of BIOS. Also check both ways if microcode got updated, install intel-ucode or ucode-intel package if not.

Core2 errata mentions only BIOS fixes without detail, which means that BIOS vendor programs at boot undocumented config registers in some NDA way to disable error pieces in silicon.